we tried upgrading to v4 today and it did not go w...
# fleet
j
we tried upgrading to v4 today and it did not go well after upgrade, our mysql db keeps crashing on this query
Copy code
Query (2b66089b0a90): INSERT IGNORE INTO scheduled_query_stats (     scheduled_query_id,     host_id,     average_memory,     denylisted,     executions,     schedule_interval,     last_executed,     output_size,     system_time,     user_time,     wall_time    )    VALUES ((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?
t
hi Jocelyn, do you have logs you can share to see what the error is?
j
that is from the mysql error log
Copy code
2021-07-21T20:00:48.271896Z 0 [Note] Server socket created on IP: '::'.
2021-07-21T20:00:48.298284Z 0 [Warning] 'user' entry 'mysql.sys@localhost' ignored in --skip-name-resolve mode.
2021-07-21T20:00:48.298300Z 0 [Warning] 'user' entry 'rdsadmin@localhost' ignored in --skip-name-resolve mode.
2021-07-21T20:00:48.360779Z 0 [Note] Event Scheduler: Loaded 1 event
2021-07-21T20:00:48.360853Z 1 [Note] Event Scheduler: scheduler thread started with id 1
2021-07-21T20:00:48.467787Z 0 [Note] /rdsdbbin/oscar/bin/mysqld: ready for connections.
Version: '5.7.12'  socket: '/tmp/mysql.sock'  port: 3306  MySQL Community Server (GPL)
20:00:51 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=16777216
read_buffer_size=262144
max_used_connections=84
max_threads=3000
thread_count=39
connection_count=38
connection_count=36
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1601274 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.


Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (2b85dd1b1a90): INSERT IGNORE INTO scheduled_query_stats (     scheduled_query_id,     host_id,     average_memory,     denylisted,     executions,     schedule_interval,     last_executed,     output_size,     system_time,     user_time,     wall_time    )    VALUES ((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?
Connection ID (thread ID): 124
Status: NOT_KILLED

The manual page at <http://dev.mysql.com/doc/mysql/en/crashing.html> contains
information that should help you find out what is causing the crash.
Writing a core file
it would crash on that as soon as it started
we have over 20k hosts, so I suspect when it's trying to do the initial data population for scheduled_query_stats it's too much for the db to handle
t
would you be able to try with mysql 5.7.35? I don't recall any crashes like this one before. I would be surprised if a bug in fleet could cause that
do you have CPU/memory metrics for the database?
j
we're using an AWS Aurora mysql global db
cpu, mem, and connections were all fine
it was literally just dying on that query over and over
the scheduled_query_stats table doesn't exist in 3.10, it looks like when 4.0 starts it tries to populate data into that new table?
so what is scheduled_query_stats for, and can we disable it somehow, for testing?
t
it's updated at different points as the hosts do requests against fleet for different things
j
why start storing stats in the db?
t
I would have to get back to you on that, as I wasn't involved in that
j
fair enough
I'd just love to get v4 working
t
It looks like the LTS version of aurora is 2.07, and based on the logs you are in at most 2.03.1
would you be able to test upgrading to 2.07?
j
we're on
Copy code
Engine version
5.7.mysql_aurora.2.07.2
Copy code
Instance class
db.r5.2xlarge
vCPU
8
RAM
64 GB
t
huh, weird, the log shows
Version: '5.7.12'  socket: '/tmp/mysql.sock'  port: 3306
which according to the logs
For Aurora MySQL 2.x, all versions 2.03.1 and lower are represented by the engine version 5.7.12.
let me do some digging and will get back to you
so it seems others have encountered similar issues, and they solved it by upgrading to 2.10 https://osquery.slack.com/archives/C01DXJL16D8/p1623355865176100?thread_ts=1623178672.165300&amp;cid=C01DXJL16D8
j
Copy code
mysql> select aurora_version(), @@aurora_version;
+------------------+------------------+
| aurora_version() | @@aurora_version |
+------------------+------------------+
| 2.07.2           | 2.07.2           |
+------------------+------------------+
1 row in set (0.00 sec)
sweet, we'll give that a try tomorrow
n
@Jocelyn Bothe thank you for your patience and help troubleshooting the upgrade. Adding context to your earlier question:
why start storing stats in the db?
Fleet 3.12.0 introduced scheduled query information (“Name”, “Description”, “Frequency”, and “Last run”) to the Host details page. This way you can verify if, and when, a scheduled query ran successfully against a specific host without having to check osquery status logs or a configured log destination. In the coming months, the Fleet team plans to surface more scheduled query performance info in the Fleet UI. This UI will likely use
scheduled_query_stats
.
j
thanks for the info!
Copy code
During our investigation, we identified the source of these failovers due to following sql statement:
INSERT IGNORE INTO scheduled_query_stats (...) VALUES (SELECT ... JOIN ...)(...)
This is a known Mysql Community edition bug and is already fixed in Aurora with the 2.09.1 release.  We recommend you to upgrade your cluster to 2.09.1 at your earliest convenience.
In the meantime, if possible could you avoid running this query to prevent further crashes.  You may instead modify your sql statement to insert single row at a time to avoid this issue.
might be worth adding a note to your upgrade guide to mention the new requirement for mysql version
💯 1