Hello, our Fleet service keeps killing itself afte...
# fleet
t
Hello, our Fleet service keeps killing itself after starting the service and rebooting the server. We are thinking the issue is memory utilization. We increased the overall RAM on the server. Does FFleet use JVM or some thing else we can configure the min & max memory of? We think the min (jxs) might be set to the max RAM, but we don't know where those settings would be. Thank you
k
Hi. @Terra! There aren't any settings on the Fleet side for memory usage, it will just use what's available to it. What version of Fleet are you running? Have you made any recent changes (added a lot more hosts, enabled software inventory or changed the infrastructure)? Are you seeing any errors in Fleet?
t
Hi @Kathy Satterlee! We made no changes to it, other than increasing RAM to 8GB this morning. We were informed yesterday that someone made a very large query and the Prod server broke after that. Upon logging in, we saw the fleet service stopped. After
systemctl start fleet.service
, memory utilization continues to spike until the service kills itself. Checked the journalctl logs, there are only authentication logs
{"component":"http","err":"authentication error: find host: context canceled","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-10-13T13:50:30.147331381Z"}
. Non-Prod server is just fine.
k
Usually, "context cancelled" errors point to timeouts or lock issues with the database. Can you run the following on your database?
Copy code
show engine innodb status;
show processlist;
And what version of Fleet are you running?
t
Unfortunately I can't run those commands. The person who set up our OSquery servers left the company, and I don't know where the admin credentials are. I can log in as the kolide user (info is in
/etc/kolide.yaml
, but the kolide user is just a regular user
We went from v3.11.0 to v4.17.1 in August. We have not updated it since (I know, we need to)
k
Ouch! Sorry, that's a tough one. It's going to be kind of hard to dig in too much without that access. If it's crashing now anyway, it might be a good idea to redeploy with an environment that you have a little more control over while getting things updated.
t
Copy code
| InnoDB |      |
=====================================
2022-10-13 12:43:05 0x7fac71af3700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 64 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 0 srv_active, 0 srv_shutdown, 63 srv_idle
srv_master_thread log flush and writes: 63
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 4
OS WAIT ARRAY INFO: signal count 4
RW-shared spins 0, rounds 6, OS waits 3
RW-excl spins 0, rounds 0, OS waits 0
RW-sx spins 0, rounds 0, OS waits 0
Spin rounds per wait: 6.00 RW-shared, 0.00 RW-excl, 0.00 RW-sx
------------
TRANSACTIONS
------------
Trx id counter 9477104131
Purge done for trx's n:o < 9477102680 undo n:o < 0 state: running but idle
History list length 2
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 421853972091160, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421853972086936, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
 ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 0; buffer pool: 0
2290 OS file reads, 137 OS file writes, 6 OS fsyncs
35.78 reads/s, 17293 avg bytes/read, 2.14 writes/s, 0.09 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 0 merges
merged operations:
 insert 0, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
Hash table size 34673, node heap has 0 buffer(s)
Hash table size 34673, node heap has 0 buffer(s)
Hash table size 34673, node heap has 0 buffer(s)
Hash table size 34673, node heap has 0 buffer(s)
Hash table size 34673, node heap has 0 buffer(s)
Hash table size 34673, node heap has 0 buffer(s)
Hash table size 34673, node heap has 0 buffer(s)
Hash table size 34673, node heap has 0 buffer(s)
0.00 hash searches/s, 10.69 non-hash searches/s
---
LOG
---
Log sequence number 3205314557869
Log flushed up to   3205314557869
Pages flushed up to 3205314557869
Last checkpoint at  3205314557860
0 pending log flushes, 0 pending chkp writes
10 log i/o's done, 0.16 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 137494528
Dictionary memory allocated 168192
Buffer pool size   8191
Free buffers       5871
Database pages     2320
Old database pages 876
Modified db pages  0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 2189, created 131, written 132
34.20 reads/s, 2.05 creates/s, 2.06 writes/s
Buffer pool hit rate 539 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 2320, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
Process ID=9573, Main thread ID=140378519127808, state: sleeping
Number of rows inserted 0, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Number of system rows inserted 0, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================
I was able to reset the root password to run those commands.
After changing the admin password on MariaDB, the fleet service is now giving me an error when starting (and then killing itself after not being able to connect)
Copy code
{
  "mysql": "could not connect to db: dial tcp 127.0.0.1:3306: connect: connection refused, sleeping 0s",
  "ts": "2022-10-13T19:55:01.604501336Z"
}
I do not know where the original admin credentials are stored or where to change this so Fleet can connect to the db again...
k
When Fleet is launched, the config is either passed as command-line flags, a config file or set as environmental variables.
One big note there though is that . while MariaDB should work as a drop-in replacement for MySQl, we don't officially support it and there have been instances where there is some odd behavior.
t
Every time I restart fleet, it's a different list of internal IP addresses responding with the TLS error. No idea why though!
k
Ooh... That's making it look like the root cause might actually be something in the TLS setup. Are you using an
RSA
cert?
t
The weird thing is, these TLS errors did not show up before I went and followed this guide to reset the admin password https://www.digitalocean.com/community/tutorials/how-to-reset-your-mysql-or-mariadb-root-password-on-ubuntu-20-04
So I feel like I just caused a new issue by doing that... And the memory is still skyrocketing as soon as the service starts up.
htop
is showing 85% usage on the Pro server whereas the Non-Prod server, fleet is only at 2%... We are using Digicert for both environments
b
Is there any chance you can scale the fleet sever horizontally rather than vertically by running more instances of fleet behind a reverse proxy/load balancer like HAProxy? How many hosts is your deployment?
t
I'm not sure. We have 600+ hosts in both PR and NP. There's a NP host for almost every PR host