Feeling the waters out here has anyone seen perfor...
# fleet
g
Feeling the waters out here has anyone seen performance issues on 4.4.0 and greater , we’re saw increased mysql load , we have “Fixed” the issue by clearing down the
label_membership
history.
This is our DB CPU once we did the upgrade sadly wasn’t caught
😳 1
This the the CPU utilisation, the first drop is taking the server offline and running migrations to go to 4.4.2 The second is the phase down after deleting the
label_membership
history
m
This might be related to my packs not getting assigned if it's a labels thing, I'm also on 4.4.2
Copy code
select * from lable_membership;
ERROR 1146 (42S02): Table 'fleet.lable_membership' doesn't exist
oh i spelled it wrong lol
j
I haven't had a chance to upgrade past 4.3.1 yet, but from the sound of this, maybe I should hold off for a bit...
t
@Gavin are you using policies?
g
We were
We nuked them also from the DB
But that was last week, still trying to figure out if the policies led to this or not and it was only resolved by cleaning down label_membership
t
right, so there was a db issue in 4.4.0 around policies that caused increase db usage, but that should be fixed in 4.4.2
do you have any more data as to what queries were running causing that amount of CPU usage?
g
Honestly I don’t everything looked normal.
t
ok, so are things back to normal currently on 4.4.2? or did the issue persist?
g
After a clear down everything is resuming.
t
I see
g
Actually only real difference before clear down my show processes was around 2k active processes from fleet.
Now excluding sleeps it’s 1
t
I see, well I'm very sorry you had to go in and delete things by hand, that's not ideal in any case. Let me know if you see this behavior again and we can debug further
g
I am going to slowly every day over the next week re-add things like vulnerability and software reporting.
👍 1
These things happen best to raise and see if it’s known or observed so we can fix for other people wanting to upgrade.
Even if it’s just knowing we need to do an additional DB migration.
t
definitely. We haven't seen any other reports like this
what's your setup for migrations?
g
Stop Fleet , take Snapshot , run migrations using a K8s job then proceed.
One item I am thinking of now is how multiple instances and scale out works.
t
that migration dance sounds good. What are you thinking regarding scaling?
g
Now that more functionality is being added how is quorum decided for operations to prevent for example Vuln data being downloaded the docs say it’s only on a single host but what if every instance is trying to do an operation at the same time.
Aka leader election.
t
we have a locking mechanism for different resources, so the first instance to get the lock wins, and the instance that holds a lock can change over time
locks last an hour or so, depending on the case
g
Hmm is the lock in the DB ?
t
it is
g
I have not looked at the code but what is the behaviour if an instance tries to get a lock but the DB times out or fails to return a result. I assume it’s not fail open on the assumption that no lock exists & it’s more of a bool lock not in use / in use
t
if it fails to acquire a lock, it assumes it's not locked, so it loops, waits, and retries later
g
Just an FYI I have quickly dropped the vuln table , increased instance count to 3 and seen no spike in DB cpu outside of when the table was being populated and active ops is still down in single figures.
Will monitor and then re-add policies tomorrow.
t
great, thank you for this thorough work!
g
@Tomas Touceda Spoke to soon I have circa 2000 open processes of
Copy code
SELECT DISTINCT s.id, scv.cve
		FROM host_software hs
		JOIN hosts h ON (hs.host_id=h.id)
		JOIN software s
		JOIN software_cpe scp ON (s.id=scp.software_id)
		JOIN software_cve scv ON (scp.id=scv.cpe_id)
		WHERE hs.host_id=? AND TRUE
Note it is all three replicas making this query.
t
hm, that happens when we are checking the stored software before saving more
it's not only for vulnerability processing
could you tell me a bit more about your infra? how many hosts, fleet instances, database size/type/version?
g
1000 hosts , 3 fleet instances, DB GCP managed mysql 5.X 4cpu cores 16gb ram.
Copy code
select count(*) from software
    -> ;
+----------+
| count(*) |
+----------+
|    48578 |
+----------+
1 row in set (9.12 sec)
Copy code
select count(*) from software_cpe
    -> ;
+----------+
| count(*) |
+----------+
|     1070 |
+----------+
1 row in set (0.07 sec)
Copy code
select count(*) from software_cve;
+----------+
| count(*) |
+----------+
|   122201 |
+----------+
1 row in set (5.38 sec)
Actually this is easier
Copy code
SELECT table_name, TABLE_ROWS FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'fleet'
    -> ;
+------------------------------------+------------+
| table_name                         | TABLE_ROWS |
+------------------------------------+------------+
| activities                         |         58 |
| app_config_json                    |          0 |
| app_configs                        |          1 |
| carve_blocks                       |          0 |
| carve_metadata                     |          0 |
| distributed_query_campaign_targets |        623 |
| distributed_query_campaigns        |        571 |
| email_changes                      |         15 |
| enroll_secrets                     |          1 |
| host_additional                    |       1281 |
| host_software                      |     480577 |
| host_users                         |      37346 |
| hosts                              |       1363 |
| invite_teams                       |          0 |
| invites                            |          2 |
| label_membership                   |       2171 |
| labels                             |          7 |
| locks                              |          0 |
| migration_status_data              |         11 |
| migration_status_tables            |         98 |
| network_interfaces                 |       9598 |
| osquery_options                    |          3 |
| pack_targets                       |         12 |
| packs                              |         16 |
| password_reset_requests            |          0 |
| policies                           |          0 |
| policy_membership                  |       NULL |
| policy_membership_history          |          0 |
| queries                            |        969 |
| scheduled_queries                  |        369 |
| scheduled_query_stats              |     296851 |
| sessions                           |         51 |
| software                           |      48615 |
| software_cpe                       |        987 |
| software_cve                       |     122111 |
| statistics                         |          0 |
| teams                              |          0 |
| user_teams                         |          0 |
| users                              |          8 |
+------------------------------------+------------+
t
hm, that's an interesting number of CVEs found. But it shouldn't be a problem, could you tell me the config you are using for the fleet instance itself?
g
The k8s or the TLS server opts ?
t
it would be the options you use to execute the fleet serve itself
g
Copy code
containers:
        - name: fleet-webserver
          image: fleetdm/fleet:v4.4.2
          command: [fleet, serve]
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: global-star-csec-tls
              mountPath: /secrets/fleet-tls
              readOnly: true
            - name: log-storage
              mountPath: /var/log/
          env:
            - name: FLEET_BETA_SOFTWARE_INVENTORY
              value: '1'
            - name: FLEET_VULNERABILITIES_DATABASES_PATH
              value: /var/log/
            - name: FLEET_MYSQL_ADDRESS
              valueFrom:
                secretKeyRef:
                  name: fleet-mysql
                  key: address
            - name: FLEET_MYSQL_DATABASE
              valueFrom:
                secretKeyRef:
                  name: fleet-mysql
                  key: database
            - name: FLEET_MYSQL_USERNAME
              valueFrom:
                secretKeyRef:
                  name: fleet-mysql
                  key: username
            - name: FLEET_MYSQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: fleet-mysql
                  key: password
            - name: FLEET_REDIS_ADDRESS
              value: 10.0.0.3:6379
            - name: FLEET_AUTH_JWT_KEY
              valueFrom:
                secretKeyRef:
                  name: fleet-server-auth-key
                  key: fleet-server-auth-key
            - name: FLEET_SERVER_ADDRESS
              value: 0.0.0.0:8080
            - name: FLEET_SERVER_CERT
              value: /secrets/fleet-tls/tls.crt
            - name: FLEET_SERVER_KEY
              value: /secrets/fleet-tls/tls.key
            - name: FLEET_LOGGING_JSON
              value: 'true'
            - name: FLEET_OSQUERY_STATUS_LOG_PLUGIN
              value: filesystem
            - name: FLEET_FILESYSTEM_STATUS_LOG_FILE
              value: /var/log/osqueryd.status.log
            - name: FLEET_FILESYSTEM_RESULT_LOG_FILE
              value: /var/log/osqueryd.results.log
Hmm interesting find within
software_cve
I have instances where there is several thousand instances of the same CVE
Copy code
select * from software_cve where cve='CVE-2021-37841';
+---------+--------+----------------+---------------------+---------------------+
| id      | cpe_id | cve            | created_at          | updated_at          |
+---------+--------+----------------+---------------------+---------------------+
|     334 |   NULL | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     337 |    120 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     341 |    123 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     343 |    122 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     345 |    125 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     346 |    124 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     347 |    126 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     348 |   NULL | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|    4835 |   NULL | CVE-2021-37841 | 2021-09-22 14:54:49 | 2021-09-22 14:54:49 |
|    4848 |   NULL | CVE-2021-37841 | 2021-09-22 14:54:49 | 2021-09-22 14:54:49 |
|    9332 |   NULL | CVE-2021-37841 | 2021-09-22 15:54:53 | 2021-09-22 15:54:53 |
|    9349 |   NULL | CVE-2021-37841 | 2021-09-22 15:54:53 | 2021-09-22 15:54:53 |
|   13833 |   NULL | CVE-2021-37841 | 2021-09-22 16:54:49 | 2021-09-22 16:54:49 |
|   13847 |   NULL | CVE-2021-37841 | 2021-09-22 16:54:49 | 2021-09-22 16:54:49 |
Copy code
select count(*) from software_cve where cve='CVE-2021-37841';
	+----------+
| count(*) |
+----------+
|     1007 |
+----------+
Copy code
| CVE-2021-32751   |       1005 |
| CVE-2021-32761   |          6 |
| CVE-2021-32777   |          1 |
| CVE-2021-32778   |          1 |
| CVE-2021-32779   |          1 |
| CVE-2021-32781   |          1 |
| CVE-2021-3282    |          1 |
| CVE-2021-32923   |          5 |
| CVE-2021-33194   |       4014 |
| CVE-2021-33195   |       4014 |
| CVE-2021-33196   |       4014 |
| CVE-2021-33197   |       4015 |
| CVE-2021-33198   |       4014 |
| CVE-2021-3331    |          2 |
| CVE-2021-33361   |          1 |
| CVE-2021-33362   |          1 |
t
hm, cpe_id should always be defined
hm, the unique constraint doesn't take into account NULLs, this could've been caused by the clearing of the tables by hand
maybe while the vulnerability processing was running
for these cases, you can check the software_cpe with those ids (120, 123, etc...):
Copy code
|     337 |    120 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     341 |    123 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     343 |    122 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     345 |    125 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     346 |    124 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
|     347 |    126 | CVE-2021-37841 | 2021-09-22 13:55:06 | 2021-09-22 13:55:06 |
g
As such
Copy code
mysql> select * from software_cpe where id = 120;
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| id  | software_id | created_at          | updated_at          | cpe                                                |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| 120 |    45494907 | 2021-09-22 13:54:19 | 2021-09-22 13:54:19 | cpe:2.3:a:docker:desktop:3.1.0:*:*:*:*:windows:*:* |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
1 row in set (1.90 sec)

mysql> select * from software_cpe where id = 123;
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| id  | software_id | created_at          | updated_at          | cpe                                                |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| 123 |    45473432 | 2021-09-22 13:54:19 | 2021-09-22 13:54:19 | cpe:2.3:a:docker:desktop:3.3.3:*:*:*:*:windows:*:* |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
1 row in set (0.23 sec)

mysql> select * from software_cpe where id = 122;
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| id  | software_id | created_at          | updated_at          | cpe                                                |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| 122 |    45474569 | 2021-09-22 13:54:19 | 2021-09-22 13:54:19 | cpe:2.3:a:docker:desktop:3.3.1:*:*:*:*:windows:*:* |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
1 row in set (0.18 sec)

mysql> select * from software_cpe where id = 125;
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| id  | software_id | created_at          | updated_at          | cpe                                                |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| 125 |    45491834 | 2021-09-22 13:54:19 | 2021-09-22 13:54:19 | cpe:2.3:a:docker:desktop:3.5.1:*:*:*:*:windows:*:* |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
1 row in set (0.05 sec)

mysql> select * from software_cpe where id = 124;
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| id  | software_id | created_at          | updated_at          | cpe                                                |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
| 124 |    45494278 | 2021-09-22 13:54:19 | 2021-09-22 13:54:19 | cpe:2.3:a:docker:desktop:3.4.0:*:*:*:*:windows:*:* |
+-----+-------------+---------------------+---------------------+----------------------------------------------------+
1 row in set (0.04 sec)
t
right, so it's different versions of docker that all seem to have the same CVE
so has this been a peak in db usage? or has it been sustained?
g
Sustained.
t
you might want to play around with
osquery.max_jitter_percent
, it defaults to 10%, but you could push that to 50% and spread out the hosts checking in
g
Shall do.
t
maybe disable vulnerability processing for now to reduce the noise
once the usage is reasonable, you can enable again
and you can also increase that interval, if it's too intense, you can do it less often
how are the fleet instances doing resource-wise? what type of instance are you using for fleet serve?
g
They’re on a k8s cluster with 16gb avaliable to them but they’re not constrained in any way.
sitting at <2gb memory used and no CPU contention.
Looping back to this , cleaning CPE and CVE tables and doing another run seems to have resolved the issue also no nulls etc
t
great! won't call it a victory just yet, just in case 🙂
g
Yip will loop back In a few more days to confirm all good
👍 1
t
@Gavin fyi, we're going to be cutting a 4.4.3 that fixes the issue that was causing intense CPU usage in the db
👍 1
ty 1
@Jocelyn Bothe do you have software inventory enabled? I recall you didn't some time ago
j
nope, we use other agents for that
depending on how our contract negotiations go though...
t
ok, then you won't be impacted by this, but I suppose it's good to wait for 4.4.3