Hey everyone. Has anyone had issues when upgradin...
# fleet
d
Hey everyone. Has anyone had issues when upgrading between 3.x versions? We've just done 3.9 to 3.12 in two environments and it's failed after running prepare db. the db goes read only and mysql crashes.
Version: '5.7.12' socket: '/tmp/mysql.sock' port: 3306 MySQL Community Server (GPL)
21:20:21 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
key_buffer_size=16777216
read_buffer_size=262144
max_used_connections=3
max_threads=1000
thread_count=4
connection_count=3
connection_count=3
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 544680 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (2b4de31a4a90): INSERT IGNORE INTO scheduled_query_stats ( scheduled_query_id, host_id, average_memory, denylisted, executions, schedule_interval, last_executed, output_size, system_time, user_time, wall_time ) VALUES ((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?)
Connection ID (thread ID): 3
Status: NOT_KILLED
The manual page at <http://dev.mysql.com/doc/mysql/en/crashing.html> contains
information that should help you find out what is causing the crash.
Writing a core file
Is it not possible to go from 3.9 to 3.12 because of the MySQL changes in 3.11?
b
so we tested stepping from 3.9 to 3.10, then 3.11, then 3.12, but we still get issues in which 3.12 itself will cause the DB to fall over
what we see is the following on a fleetd client:
Copy code
2021-06-08T22:34:09.235157+00:00 <redacted> fleet[5699]: {"component":"http","err":"authentication error: finding host","level":"info","path":"/api/v1/osquery/log","ts":"2021-06-08T22:34:09.234995728Z"}
2021-06-08T22:34:09.271781+00:00 <redacted> fleet[5699]: {"component":"service","err":"authentication error: finding host","ip_addr":"127.0.0.1:35368","level":"info","method":"AuthenticateHost","took":"1.9962ms","ts":"2021-06-08T22:34:09.271652715Z","x_for_ip_addr":"<redacted>"}
....
2021-06-08T22:34:12.483990+00:00 <redacted> fleet[5699]: {"component":"service","err":"failed to save labels: insert label query executions: Error 1290: The MySQL server is running with the --read-only option so it cannot execute this statement","ip_addr":"127.0.0.1:35864","level":"info","method":"SubmitDistributedQueryResults","took":"35.482963ms","ts":"2021-06-08T22:34:12.483832798Z","x_for_ip_addr":"<redacted>"}
...
2021-06-08T22:34:18.760844+00:00 <redacted> fleet[5699]: [mysql] 2021/06/08 22:34:18 packets.go:36: read tcp <redacted>:44738->10.242.0.34:3306: read: connection reset by peer
2021-06-08T22:34:18.760936+00:00 <redacted> fleet[5699]: [mysql] 2021/06/08 22:34:18 packets.go:36: read tcp <redacted>:45138->10.242.0.34:3306: read: connection reset by peer
we get spammed with the first section of logs I pasted above (the api auth errors), then the DB goes into read-only mode as it is crashing, then it goes completely offline
we're using Mysql 5.7.x, AWS RDS
(Aurora)
d
Thanks @buddwm. @zwass, have you ever seen anything like this?
Seems like some DB tables were dropped between versions. Shouldn't the prepare db command account for this?
I'm surprised there haven't been any comments from the community here. @buddwm, I think we should just open a git issue for this.
ty 1
b
can someone give us an answer as to what this means?
Copy code
fleet[23769]: {"component":"http","err":"authentication error: finding host","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2021-06-07T19:02:36.696209002Z"}
We see a lot of related messages after upgrading and wondering if this is related to the DB fall over we see
z
The MySQL server crashing like that is not something I've ever seen before, and we haven't heard of similar issues with other deployments.
It looks like 5.7.12 is about 5 years old... is that the latest version AWS offers? I'm not aware of us using any new features that would be unsupported (let alone crash the server), but possibly there are bugs that have been fixed since then.
b
we don't see this in a fresh deployment of 3.12
only in an upgrade scenario
we opened up general logging to see if we can understand what causes the db fall over, i'm seeing queries like this:
Copy code
Query (2b4de31a4a90): INSERT IGNORE INTO scheduled_query_stats ( scheduled_query_id, host_id, average_memory, denylisted, executions, schedule_interval, last_executed, output_size, system_time, user_time, wall_time ) VALUES ((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?),((SELECT sq.id FROM scheduled_queries sq JOIN packs p ON (sq.pack_id = p.id) WHERE p.name = ? AND sq.name = ?),?,?,?,?,?,?,?,?,?,?)
Connection ID (thread ID): 3
does that look normal?
z
Yes, that looks normal.
b
ok - just making sure, still combing through this, we might try to do an upgrade on mysql 8.x rds to see if we hit the same behavior
i'll report results once we have them, thanks @zwass
z
MySQL 8 should work. Most of our customers and open-source users are on 5.7. I see that RDS supports up to 5.7.33 and only back to 5.7.16. This, along with the age of 5.7.12 makes me think a newer version of 5.7 would do the trick as well.
👍 1
b
so we updated the aurora engine from 2.07.2 to 2.10.0 and all is well
🤷‍♂️ 1
not sure of root cause but we'll take it
👍 1
@zwass do you know what this is?
Copy code
{
  "component": "service",
  "err": "failed to ingest result: loading orphaned campaign: selecting distributed query campaign: sql: no rows in result set",
  "ip_addr": "127.0.0.1:40492",
  "level": "info",
  "method": "SubmitDistributedQueryResults",
  "took": "4.25624ms",
  "ts": "2021-06-10T20:17:33.405509079Z",
  "x_for_ip_addr": "<redacted>"
}
we're seeing that from the fleet service getting spammed over and over after the upgrade - it doesn't appear to be causing issues at the UI level, but i'd like to get it to stop spamming that if possible
d
We have 3.12 working, but after upgrading to 3.13 the hosts page errors with an http 500.
list hosts: Error 1054: Unknown column 'additional' in 'field list'
I think we'll just stick at 3.12, but wondering if it's worth opening a git issue on that so that the Fleet team can look at it. Not sure if it's somehow our issue or Fleet's.
z
That last error should not be resulting in any problems, though it is noisy. I believe we cleaned it up in 3.13.
Your 3.13 error looks like it would be due to the DB migrations not completed.
🙏 1
d
Thanks