Title
#fleet
Gavin

Gavin

10/18/2021, 3:37 PM
Feeling the waters out here has anyone seen performance issues on 4.4.0 and greater , we’re saw increased mysql load , we have “Fixed” the issue by clearing down the
label_membership
history.
3:38 PM
This is our DB CPU once we did the upgrade sadly wasn’t caught
3:40 PM
This the the CPU utilisation, the first drop is taking the server offline and running migrations to go to 4.4.2 The second is the phase down after deleting the
label_membership
history
3:40 PM
Mystery Incorporated

Mystery Incorporated

10/18/2021, 3:50 PM
This might be related to my packs not getting assigned if it's a labels thing, I'm also on 4.4.2
3:52 PM
select * from lable_membership;
ERROR 1146 (42S02): Table 'fleet.lable_membership' doesn't exist
3:53 PM
oh i spelled it wrong lol
Jocelyn Bothe

Jocelyn Bothe

10/18/2021, 3:53 PM
I haven't had a chance to upgrade past 4.3.1 yet, but from the sound of this, maybe I should hold off for a bit...
Tomas Touceda

Tomas Touceda

10/18/2021, 3:54 PM
@Gavin are you using policies?
Gavin

Gavin

10/18/2021, 3:54 PM
We were
3:54 PM
We nuked them also from the DB
3:55 PM
But that was last week, still trying to figure out if the policies led to this or not and it was only resolved by cleaning down label_membership
Tomas Touceda

Tomas Touceda

10/18/2021, 3:55 PM
right, so there was a db issue in 4.4.0 around policies that caused increase db usage, but that should be fixed in 4.4.2
3:55 PM
do you have any more data as to what queries were running causing that amount of CPU usage?
Gavin

Gavin

10/18/2021, 3:55 PM
Honestly I don’t everything looked normal.
Tomas Touceda

Tomas Touceda

10/18/2021, 3:56 PM
ok, so are things back to normal currently on 4.4.2? or did the issue persist?
Gavin

Gavin

10/18/2021, 3:57 PM
After a clear down everything is resuming.
Tomas Touceda

Tomas Touceda

10/18/2021, 3:57 PM
I see
Gavin

Gavin

10/18/2021, 3:57 PM
Actually only real difference before clear down my show processes was around 2k active processes from fleet.
3:57 PM
Now excluding sleeps it’s 1
Tomas Touceda

Tomas Touceda

10/18/2021, 3:59 PM
I see, well I'm very sorry you had to go in and delete things by hand, that's not ideal in any case. Let me know if you see this behavior again and we can debug further
Gavin

Gavin

10/18/2021, 3:59 PM
I am going to slowly every day over the next week re-add things like vulnerability and software reporting.
3:59 PM
These things happen best to raise and see if it’s known or observed so we can fix for other people wanting to upgrade.
3:59 PM
Even if it’s just knowing we need to do an additional DB migration.
Tomas Touceda

Tomas Touceda

10/18/2021, 4:00 PM
definitely. We haven't seen any other reports like this
4:00 PM
what's your setup for migrations?
Gavin

Gavin

10/18/2021, 4:00 PM
Stop Fleet , take Snapshot , run migrations using a K8s job then proceed.
4:01 PM
One item I am thinking of now is how multiple instances and scale out works.
Tomas Touceda

Tomas Touceda

10/18/2021, 4:02 PM
that migration dance sounds good. What are you thinking regarding scaling?
Gavin

Gavin

10/18/2021, 4:02 PM
Now that more functionality is being added how is quorum decided for operations to prevent for example Vuln data being downloaded the docs say it’s only on a single host but what if every instance is trying to do an operation at the same time.
4:02 PM
Aka leader election.
Tomas Touceda

Tomas Touceda

10/18/2021, 4:04 PM
we have a locking mechanism for different resources, so the first instance to get the lock wins, and the instance that holds a lock can change over time
4:04 PM
locks last an hour or so, depending on the case
Gavin

Gavin

10/18/2021, 4:07 PM
Hmm is the lock in the DB ?
Tomas Touceda

Tomas Touceda

10/18/2021, 4:07 PM
it is
Gavin

Gavin

10/18/2021, 4:09 PM
I have not looked at the code but what is the behaviour if an instance tries to get a lock but the DB times out or fails to return a result. I assume it’s not fail open on the assumption that no lock exists & it’s more of a bool lock not in use / in use
Tomas Touceda

Tomas Touceda

10/18/2021, 4:10 PM
if it fails to acquire a lock, it assumes it's not locked, so it loops, waits, and retries later
Gavin

Gavin

10/18/2021, 4:13 PM
Just an FYI I have quickly dropped the vuln table , increased instance count to 3 and seen no spike in DB cpu outside of when the table was being populated and active ops is still down in single figures.
4:13 PM
Will monitor and then re-add policies tomorrow.
Tomas Touceda

Tomas Touceda

10/18/2021, 4:14 PM
great, thank you for this thorough work!
Gavin

Gavin

10/18/2021, 4:51 PM
@Tomas Touceda Spoke to soon I have circa 2000 open processes of
SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=? AND TRUE
4:54 PM
Note it is all three replicas making this query.
Tomas Touceda

Tomas Touceda

10/18/2021, 4:54 PM
hm, that happens when we are checking the stored software before saving more
4:54 PM
it's not only for vulnerability processing
4:54 PM
could you tell me a bit more about your infra? how many hosts, fleet instances, database size/type/version?
Gavin

Gavin

10/18/2021, 4:55 PM
1000 hosts , 3 fleet instances, DB GCP managed mysql 5.X 4cpu cores 16gb ram.
4:56 PM
select count(*) from software
    -> ;
+----------+
| count(*) |
+----------+
|    48578 |
+----------+
1 row in set (9.12 sec)
4:56 PM
select count(*) from software_cpe
    -> ;
+----------+
| count(*) |
+----------+
|     1070 |
+----------+
1 row in set (0.07 sec)
4:57 PM
select count(*) from software_cve;
+----------+
| count(*) |
+----------+
|   122201 |
+----------+
1 row in set (5.38 sec)
4:58 PM
Actually this is easier
SELECT table_name, TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'fleet'
    -> ;
+------------------------------------+------------+
| table_name                         | TABLE_ROWS |
+------------------------------------+------------+
| activities                         |         58 |
| app_config_json                    |          0 |
| app_configs                        |          1 |
| carve_blocks                       |          0 |
| carve_metadata                     |          0 |
| distributed_query_campaign_targets |        623 |
| distributed_query_campaigns        |        571 |
| email_changes                      |         15 |
| enroll_secrets                     |          1 |
| host_additional                    |       1281 |
| host_software                      |     480577 |
| host_users                         |      37346 |
| hosts                              |       1363 |
| invite_teams                       |          0 |
| invites                            |          2 |
| label_membership                   |       2171 |
| labels                             |          7 |
| locks                              |          0 |
| migration_status_data              |         11 |
| migration_status_tables            |         98 |
| network_interfaces                 |       9598 |
| osquery_options                    |          3 |
| pack_targets                       |         12 |
| packs                              |         16 |
| password_reset_requests            |          0 |
| policies                           |          0 |
| policy_membership                  |       NULL |
| policy_membership_history          |          0 |
| queries                            |        969 |
| scheduled_queries                  |        369 |
| scheduled_query_stats              |     296851 |
| sessions                           |         51 |
| software                           |      48615 |
| software_cpe                       |        987 |
| software_cve                       |     122111 |
| statistics                         |          0 |
| teams                              |          0 |
| user_teams                         |          0 |
| users                              |          8 |
+------------------------------------+------------+
Tomas Touceda

Tomas Touceda

10/18/2021, 5:02 PM
hm, that's an interesting number of CVEs found. But it shouldn't be a problem, could you tell me the config you are using for the fleet instance itself?
Gavin

Gavin

10/18/2021, 5:06 PM
The k8s or the TLS server opts ?
Tomas Touceda

Tomas Touceda

10/18/2021, 5:08 PM
it would be the options you use to execute the fleet serve itself
Gavin

Gavin

10/18/2021, 5:10 PM
containers:
        - name: fleet-webserver
          image: fleetdm/fleet:v4.4.2
          command: [fleet, serve]
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: global-star-csec-tls
              mountPath: /secrets/fleet-tls
              readOnly: true
            - name: log-storage
              mountPath: /var/log/
          env:
            - name: FLEET_BETA_SOFTWARE_INVENTORY
              value: '1'
            - name: FLEET_VULNERABILITIES_DATABASES_PATH
              value: /var/log/
            - name: FLEET_MYSQL_ADDRESS
              valueFrom:
                secretKeyRef:
                  name: fleet-mysql
                  key: address
            - name: FLEET_MYSQL_DATABASE
              valueFrom:
                secretKeyRef:
                  name: fleet-mysql
                  key: database
            - name: FLEET_MYSQL_USERNAME
              valueFrom:
                secretKeyRef:
                  name: fleet-mysql
                  key: username
            - name: FLEET_MYSQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: fleet-mysql
                  key: password
            - name: FLEET_REDIS_ADDRESS
              value: 10.0.0.3:6379
            - name: FLEET_AUTH_JWT_KEY
              valueFrom:
                secretKeyRef:
                  name: fleet-server-auth-key
                  key: fleet-server-auth-key
            - name: FLEET_SERVER_ADDRESS
              value: 0.0.0.0:8080
            - name: FLEET_SERVER_CERT
              value: /secrets/fleet-tls/tls.crt
            - name: FLEET_SERVER_KEY
              value: /secrets/fleet-tls/tls.key
            - name: FLEET_LOGGING_JSON
              value: 'true'
            - name: FLEET_OSQUERY_STATUS_LOG_PLUGIN
              value: filesystem
            - name: FLEET_FILESYSTEM_STATUS_LOG_FILE
              value: /var/log/osqueryd.status.log
            - name: FLEET_FILESYSTEM_RESULT_LOG_FILE
              value: /var/log/osqueryd.results.log
5:12 PM
Hmm interesting find within
software_cve
I have instances where there is several thousand instances of the same CVE
select * from software_cve where cve='CVE-2021-37841';
+---------+--------+----------------+---------------------+---------------------+
| id      | cpe_id | cve            | created_at          | updated_at          |
+---------+--------+----------------+---------------------+---------------------+
|     334 |   NULL | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     337 |    120 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     341 |    123 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     343 |    122 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     345 |    125 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     346 |    124 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     347 |    126 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     348 |   NULL | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|    4835 |   NULL | CVE-2021-37841 | 2021-09-22 14:54:49 | 2021-09-22 14:54:49 |
|    4848 |   NULL | CVE-2021-37841 | 2021-09-22 14:54:49 | 2021-09-22 14:54:49 |
|    9332 |   NULL | CVE-2021-37841 | 2021-09-22 15:54:53 | 2021-09-22 15:54:53 |
|    9349 |   NULL | CVE-2021-37841 | 2021-09-22 15:54:53 | 2021-09-22 15:54:53 |
|   13833 |   NULL | CVE-2021-37841 | 2021-09-22 16:54:49 | 2021-09-22 16:54:49 |
|   13847 |   NULL | CVE-2021-37841 | 2021-09-22 16:54:49 | 2021-09-22 16:54:49 |
5:13 PM
select count(*) from software_cve where cve='CVE-2021-37841';
	+----------+
| count(*) |
+----------+
|     1007 |
+----------+
5:14 PM
| CVE-2021-32751   |       1005 |
| CVE-2021-32761   |          6 |
| CVE-2021-32777   |          1 |
| CVE-2021-32778   |          1 |
| CVE-2021-32779   |          1 |
| CVE-2021-32781   |          1 |
| CVE-2021-3282    |          1 |
| CVE-2021-32923   |          5 |
| CVE-2021-33194   |       4014 |
| CVE-2021-33195   |       4014 |
| CVE-2021-33196   |       4014 |
| CVE-2021-33197   |       4015 |
| CVE-2021-33198   |       4014 |
| CVE-2021-3331    |          2 |
| CVE-2021-33361   |          1 |
| CVE-2021-33362   |          1 |
Tomas Touceda

Tomas Touceda

10/18/2021, 5:19 PM
hm, cpe_id should always be defined
5:21 PM
hm, the unique constraint doesn't take into account NULLs, this could've been caused by the clearing of the tables by hand
5:21 PM
maybe while the vulnerability processing was running
5:23 PM
for these cases, you can check the software_cpe with those ids (120, 123, etc...):
|     337 |    120 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     341 |    123 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     343 |    122 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     345 |    125 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     346 |    124 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     347 |    126 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
Gavin

Gavin

10/18/2021, 5:25 PM
As such
mysql> select * from software_cpe where id = 120;
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| id  | software_id | created_at          | updated_at          | cpe                                                |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| 120 |    45494907 | 2021-09-22 13:54:19 | 2021-09-22 13:54:19 | cpe:2.3:a:docker:desktop:3.1.0:*:*:*:*:windows:*:* |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
1 row in set (1.90 sec)

mysql> select * from software_cpe where id = 123;
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| id  | software_id | created_at          | updated_at          | cpe                                                |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| 123 |    45473432 | 2021-09-22 13:54:19 | 2021-09-22 13:54:19 | cpe:2.3:a:docker:desktop:3.3.3:*:*:*:*:windows:*:* |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
1 row in set (0.23 sec)

mysql> select * from software_cpe where id = 122;
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| id  | software_id | created_at          | updated_at          | cpe                                                |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| 122 |    45474569 | 2021-09-22 13:54:19 | 2021-09-22 13:54:19 | cpe:2.3:a:docker:desktop:3.3.1:*:*:*:*:windows:*:* |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
1 row in set (0.18 sec)

mysql> select * from software_cpe where id = 125;
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| id  | software_id | created_at          | updated_at          | cpe                                                |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| 125 |    45491834 | 2021-09-22 13:54:19 | 2021-09-22 13:54:19 | cpe:2.3:a:docker:desktop:3.5.1:*:*:*:*:windows:*:* |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
1 row in set (0.05 sec)

mysql> select * from software_cpe where id = 124;
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| id  | software_id | created_at          | updated_at          | cpe                                                |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| 124 |    45494278 | 2021-09-22 13:54:19 | 2021-09-22 13:54:19 | cpe:2.3:a:docker:desktop:3.4.0:*:*:*:*:windows:*:* |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
1 row in set (0.04 sec)
Tomas Touceda

Tomas Touceda

10/18/2021, 5:26 PM
right, so it's different versions of docker that all seem to have the same CVE
5:28 PM
so has this been a peak in db usage? or has it been sustained?
Gavin

Gavin

10/18/2021, 5:29 PM
Sustained.
Tomas Touceda

Tomas Touceda

10/18/2021, 5:31 PM
you might want to play around with
osquery.max_jitter_percent
, it defaults to 10%, but you could push that to 50% and spread out the hosts checking in
Gavin

Gavin

10/18/2021, 5:33 PM
Shall do.
Tomas Touceda

Tomas Touceda

10/18/2021, 5:33 PM
maybe disable vulnerability processing for now to reduce the noise
5:34 PM
once the usage is reasonable, you can enable again
5:34 PM
and you can also increase that interval, if it's too intense, you can do it less often
5:35 PM
how are the fleet instances doing resource-wise? what type of instance are you using for fleet serve?
Gavin

Gavin

10/18/2021, 5:36 PM
They’re on a k8s cluster with 16gb avaliable to them but they’re not constrained in any way.
5:37 PM
sitting at <2gb memory used and no CPU contention.
10:14 PM
Looping back to this , cleaning CPE and CVE tables and doing another run seems to have resolved the issue also no nulls etc
Tomas Touceda

Tomas Touceda

10/19/2021, 10:18 PM
great! won't call it a victory just yet, just in case 🙂
Gavin

Gavin

10/19/2021, 10:22 PM
Yip will loop back In a few more days to confirm all good
Tomas Touceda

Tomas Touceda

10/20/2021, 5:18 PM
@Gavin fyi, we're going to be cutting a 4.4.3 that fixes the issue that was causing intense CPU usage in the db
5:27 PM
@Jocelyn Bothe do you have software inventory enabled? I recall you didn't some time ago
Jocelyn Bothe

Jocelyn Bothe

10/20/2021, 5:28 PM
nope, we use other agents for that
5:28 PM
depending on how our contract negotiations go though...
Tomas Touceda

Tomas Touceda

10/20/2021, 5:29 PM
ok, then you won't be impacted by this, but I suppose it's good to wait for 4.4.3