Title
#kolide
s

Sal

09/30/2020, 3:44 PM
Can anyone provide some assistance or guidance on MYSQL deadlock/locks. We have 60 fleet host connecting to a MYSQL percona cluster of 4 hosts. For the past two days we've been getting deadlocks/lock on MYSQL that seem to increase overtime and cause the ALL the fleet hosts to eventually exhaust memory causing a denial of service. Fleet becomes unresponsive and subsequently host fail to report in. We are running fleet version 3.0.0-3-g0619581b. Our DBA folks seem to think it's an application issue with fleet hitting the hosts table.
DBA team response: 
Essentially this is not db server tuning issue, but an app tuning issue. I understand this needs to be related to the vendor, however as you can see from the Humio links inserts & updates are clashing on the 'hosts' table. The correct approach is to minimize these deadlock situations arising in the first place. The engine will detect and roll back 1 transaction - eg in detail:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2020-09-30 09:46:45 0x7de11cce9700
*** (1) TRANSACTION:
TRANSACTION 35762332832, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 1157846, OS thread handle 139400941860608, query id 14939859382 <http://prod-10-32-7-83.pw1.bcc.somehost.com|prod-10-32-7-83.pw1.bcc.somehost.com> 10.32.7.83 tlsproddb_user updating
UPDATE hosts SET
			seen_time = '2020-09-30 13:46:45.647691'
		WHERE node_key='6u8qphrXq6FhlWXYZfYWHT2xwBHX+SsN'
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 771 page no 7 n bits 88 index PRIMARY of table `tlsproddb`.`hosts` trx id 35762332832 lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 35762332830, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 1
MySQL thread id 1157730, OS thread handle 138405804414720, query id 14939859380 <http://prod-10-32-1-12.pw1.bcc.somehost.com|prod-10-32-1-12.pw1.bcc.somehost.com> 10.32.1.12 tlsproddb_user update
INSERT INTO hosts (
			detail_update_time,
			label_update_time,
			osquery_host_id,
			seen_time,
			node_key,
			enroll_secret_name
		) VALUES ('1970-01-02 00:00:00', '1970-01-02 00:00:00', 'DB938000-FD74-11E7-8000-000000000000', '2020-09-30 13:46:45.647665', 'WqAC/7dPeDzkh90b5yb4TIIvvJrgd4ut', 'default')
		ON DUPLICATE KEY UPDATE
			node_key = VALUES(node_key),
			deleted = FALSE
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 771 page no 7 n bits 88 index PRIMARY of table `tlsproddb`.`hosts` trx id 35762332830 lock_mode X locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 771 page no 3489 n bits 248 index idx_host_unique_nodekey of table `tlsproddb`.`hosts` trx id 35762332830 lock_mode X locks rec but not gap waiting
*** WE ROLL BACK TRANSACTION (1)
Any help on this matter would be greatly appreciated.
zwass

zwass

10/01/2020, 7:02 PM
Is this connected to your earlier report https://osquery.slack.com/archives/C1XCLA5DZ/p1601375347041700? Do you have a number of new hosts enrolling at this time? What is the total number of enrolled hosts?
a

Adam

10/02/2020, 7:17 AM
the number of hosts enrolled has remained fairly static, which is roughly around 50k
zwass

zwass

10/02/2020, 7:03 PM
Can you run
select @@global.transaction_isolation, @@global.tx_isolation, @@global.autocommit;
in your db?
7:49 PM
Do you see errors like
ts=2020-10-02T19:48:16.182951Z component=http err="authentication error: invalid node key: g95Bpj/mvBjjmYoJqEbCwXK3rt9UJiX4"
near to the
method=EnrollAgent
lines in the Fleet stderr logs?
s

Sal

10/02/2020, 7:50 PM
Yea, i believe so.
7:51 PM
just prior to these messages.
zwass

zwass

10/02/2020, 7:51 PM
Is there a chance you have hosts with the same host_identifier from osquery? Perhaps because of a cloned VM image or something like that?
7:54 PM
I'm going to add host_identifier to the logs from the EnrollAgent endpoint, which will help us establish this ^
s

Sal

10/02/2020, 7:54 PM
we did resolve that in the past. not sure how we can find that out as fleet doesn't pass that info along from waht i recall.
7:55 PM
i was trying to log stderr to out splunk so i can review.
7:57 PM
@zwass
+--------------------------------+-----------------------+---------------------+
| @@global.transaction_isolation | @@global.tx_isolation | @@global.autocommit | +--------------------------------+-----------------------+---------------------+ | READ-COMMITTED | READ-COMMITTED | 1 | +--------------------------------+-----------------------+---------------------+ 1 row in set, 1 warning (0.00 sec)
s

Sal

10/02/2020, 8:16 PM
we're moving our fleet process out of the docker container so we can collect the fleet log via stderr/stdout via syslog. It is difficult for us to get it via "docker -f logs prosess"
zwass

zwass

10/02/2020, 8:24 PM
Sounds good. Fleet server doesn't care what environment it is run in as long as it has access to the MySQL and Redis servers.
8:23 PM
I've put up the PR https://github.com/kolide/fleet/pull/2318 that I think may also help with this scenario. I'll be cutting a new release soon and it will be helpful for you to upgrade and verify with that new log whether hosts are clobbering each other's enrollment (which will need to be addressed if it is a root cause), as well as trying to resolve the deadlock issues.
1:18 AM
@Sal What values are you setting for
mysql_max_open_conns
,
mysql_max_idle_conns
, and
mysql_conn_max_lifetime
?
1:29 AM
Additionally, what is the configured timeout on the MySQL server side for connections?
s

Sal

10/08/2020, 1:09 PM
@zwass Here are our setting.
kolide_mysql_max_open_conns: "120"
kolide_mysql_max_idle_conns: "120"

mysql> SHOW VARIABLES LIKE 'wait_timeout';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wait_timeout  | 3600  |
+---------------+-------+
1 row in set (0.01 sec)
Max Lifetime is default and or not available in Fleet 3.0
4:30 PM
@zwass what are you thoughts using https://github.com/gliderlabs/logspout to capture stderr from the fleet docker container? Unless you have some other options to collect fleet log from a docker container.
zwass

zwass

10/08/2020, 4:43 PM
I wasn't familiar with logspout, but I took a quick look. Do you need the additional functionality of it vs. standard docker logging drivers (https://docs.docker.com/config/containers/logging/configure/)? It is briefly discussed in https://github.com/gliderlabs/logspout/issues/383.
4:49 PM
Also
mysql_conn_max_lifetime
goes very far back in Fleet. Based on the commentary in https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/ it may no longer be necessary to set with the changes with the update to the MySQL driver in https://github.com/kolide/fleet/pull/2321. That will go out in the next release which I will be cutting shortly.
s

Sal

10/08/2020, 5:01 PM
Thanks @zwass we've tried the json drive and journald with no luck. As for the max_lifetime that would be whatever the default timer is then.
zwass

zwass

10/08/2020, 5:02 PM
For now try setting it to 30m or so. Then upgrade to the 3.2.0 I cut soon.
9:55 PM
When you have a chance please upgrade to 3.2.0 (now released) and let's see (1) whether the host_identifier field in the EnrollAgent log lines is repeating and (2) if the memory issue is still present.