Can anyone provide some assistance or guidance on MYSQL dead osquery #kolide

Can anyone provide some assistance or guidance on ...

Sal

09/30/2020, 3:44 PM

Can anyone provide some assistance or guidance on MYSQL deadlock/locks. We have 60 fleet host connecting to a MYSQL percona cluster of 4 hosts. For the past two days we've been getting deadlocks/lock on MYSQL that seem to increase overtime and cause the ALL the fleet hosts to eventually exhaust memory causing a denial of service. Fleet becomes unresponsive and subsequently host fail to report in. We are running fleet version 3.0.0-3-g0619581b. Our DBA folks seem to think it's an application issue with fleet hitting the hosts table.

Copy code

DBA team response: 
Essentially this is not db server tuning issue, but an app tuning issue. I understand this needs to be related to the vendor, however as you can see from the Humio links inserts & updates are clashing on the 'hosts' table. The correct approach is to minimize these deadlock situations arising in the first place. The engine will detect and roll back 1 transaction - eg in detail:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2020-09-30 09:46:45 0x7de11cce9700
*** (1) TRANSACTION:
TRANSACTION 35762332832, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 1157846, OS thread handle 139400941860608, query id 14939859382 <http://prod-10-32-7-83.pw1.bcc.somehost.com|prod-10-32-7-83.pw1.bcc.somehost.com> 10.32.7.83 tlsproddb_user updating
UPDATE hosts SET
			seen_time = '2020-09-30 13:46:45.647691'
		WHERE node_key='6u8qphrXq6FhlWXYZfYWHT2xwBHX+SsN'
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 771 page no 7 n bits 88 index PRIMARY of table `tlsproddb`.`hosts` trx id 35762332832 lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 35762332830, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 1
MySQL thread id 1157730, OS thread handle 138405804414720, query id 14939859380 <http://prod-10-32-1-12.pw1.bcc.somehost.com|prod-10-32-1-12.pw1.bcc.somehost.com> 10.32.1.12 tlsproddb_user update
INSERT INTO hosts (
			detail_update_time,
			label_update_time,
			osquery_host_id,
			seen_time,
			node_key,
			enroll_secret_name
		) VALUES ('1970-01-02 00:00:00', '1970-01-02 00:00:00', 'DB938000-FD74-11E7-8000-000000000000', '2020-09-30 13:46:45.647665', 'WqAC/7dPeDzkh90b5yb4TIIvvJrgd4ut', 'default')
		ON DUPLICATE KEY UPDATE
			node_key = VALUES(node_key),
			deleted = FALSE
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 771 page no 7 n bits 88 index PRIMARY of table `tlsproddb`.`hosts` trx id 35762332830 lock_mode X locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 771 page no 3489 n bits 248 index idx_host_unique_nodekey of table `tlsproddb`.`hosts` trx id 35762332830 lock_mode X locks rec but not gap waiting
*** WE ROLL BACK TRANSACTION (1)

Any help on this matter would be greatly appreciated.

zwass

10/01/2020, 7:02 PM

Is this connected to your earlier report https://osquery.slack.com/archives/C1XCLA5DZ/p1601375347041700? Do you have a number of new hosts enrolling at this time? What is the total number of enrolled hosts?

Adam

10/02/2020, 7:17 AM

the number of hosts enrolled has remained fairly static, which is roughly around 50k

zwass

10/02/2020, 7:03 PM

Can you run

select @@global.transaction_isolation, @@global.tx_isolation, @@global.autocommit;

in your db?

zwass

10/02/2020, 7:49 PM

Do you see errors like

ts=2020-10-02T19:48:16.182951Z component=http err="authentication error: invalid node key: g95Bpj/mvBjjmYoJqEbCwXK3rt9UJiX4"

near to the

method=EnrollAgent

lines in the Fleet stderr logs?

Sal

10/02/2020, 7:50 PM

Yea, i believe so.

Sal

10/02/2020, 7:51 PM

just prior to these messages.

zwass

10/02/2020, 7:51 PM

Is there a chance you have hosts with the same host_identifier from osquery? Perhaps because of a cloned VM image or something like that?

zwass

10/02/2020, 7:54 PM

I'm going to add host_identifier to the logs from the EnrollAgent endpoint, which will help us establish this ^

Sal

10/02/2020, 7:54 PM

we did resolve that in the past. not sure how we can find that out as fleet doesn't pass that info along from waht i recall.

Sal

10/02/2020, 7:55 PM

i was trying to log stderr to out splunk so i can review.

Sal

10/02/2020, 7:57 PM

@zwass

+--------------------------------+-----------------------+---------------------+

zwass

10/02/2020, 8:10 PM

https://github.com/kolide/fleet/pull/2313

👍 1

Sal

10/02/2020, 8:16 PM

we're moving our fleet process out of the docker container so we can collect the fleet log via stderr/stdout via syslog. It is difficult for us to get it via "docker -f logs prosess"

zwass

10/02/2020, 8:24 PM

Sounds good. Fleet server doesn't care what environment it is run in as long as it has access to the MySQL and Redis servers.

zwass

10/06/2020, 8:23 PM

I've put up the PR https://github.com/kolide/fleet/pull/2318 that I think may also help with this scenario. I'll be cutting a new release soon and it will be helpful for you to upgrade and verify with that new log whether hosts are clobbering each other's enrollment (which will need to be addressed if it is a root cause), as well as trying to resolve the deadlock issues.

👍 1

zwass

10/08/2020, 1:18 AM

@Sal What values are you setting for

mysql_max_open_conns

mysql_max_idle_conns

, and

mysql_conn_max_lifetime

zwass

10/08/2020, 1:29 AM

Additionally, what is the configured timeout on the MySQL server side for connections?

Sal

10/08/2020, 1:09 PM

@zwass Here are our setting.

Copy code

kolide_mysql_max_open_conns: "120"
kolide_mysql_max_idle_conns: "120"

mysql> SHOW VARIABLES LIKE 'wait_timeout';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wait_timeout  | 3600  |
+---------------+-------+
1 row in set (0.01 sec)

Max Lifetime is default and or not available in Fleet 3.0

Sal

10/08/2020, 4:30 PM

@zwass what are you thoughts using https://github.com/gliderlabs/logspout to capture stderr from the fleet docker container? Unless you have some other options to collect fleet log from a docker container.

zwass

10/08/2020, 4:43 PM

I wasn't familiar with logspout, but I took a quick look. Do you need the additional functionality of it vs. standard docker logging drivers (https://docs.docker.com/config/containers/logging/configure/)? It is briefly discussed in https://github.com/gliderlabs/logspout/issues/383.

zwass

10/08/2020, 4:49 PM

Also

mysql_conn_max_lifetime

goes very far back in Fleet. Based on the commentary in https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/ it may no longer be necessary to set with the changes with the update to the MySQL driver in https://github.com/kolide/fleet/pull/2321. That will go out in the next release which I will be cutting shortly.

Sal

10/08/2020, 5:01 PM

Thanks @zwass we've tried the json drive and journald with no luck. As for the max_lifetime that would be whatever the default timer is then.

zwass

10/08/2020, 5:02 PM

For now try setting it to 30m or so. Then upgrade to the 3.2.0 I cut soon.

zwass

10/08/2020, 9:55 PM

When you have a chance please upgrade to 3.2.0 (now released) and let's see (1) whether the host_identifier field in the EnrollAgent log lines is repeating and (2) if the memory issue is still present.

👍 1

2 Views

Open in Slack

Previous Next