Hey guys, my fleet process started acting weird lately. It starts then takes over all of the CPU and RAM and stops after a few minutes.
I have about 15000 hosts and this has never happened before, any ideas?
11/10/2020, 2:42 PM
What version of Fleet are you running?
11/10/2020, 2:55 PM
Currently 3.0.0, planning on upgrading to 3.3.0 in the near future
11/10/2020, 3:21 PM
Do your logs by chance include many requests to the EnrollAgent endpoint?
11/10/2020, 3:27 PM
11/10/2020, 3:30 PM
Is that expected for you? Do you have quite a few new agents enrolling?
If not, is it possible you've deployed a number of hosts with the same hardware UUID? Perhaps by copying a VM?
11/10/2020, 3:38 PM
That might be the case, is that the cause?
11/10/2020, 3:49 PM
We saw similar with another user. The problem is that enrollment is a bit of an expensive operation and if there are multiple hosts that appear to be the same host to Fleet they will continually overwrite the enrollment.
Here are some notes from that conversation:
Status Quo (host_identifier=uuid)
- Works until hosts have the same UUID. Seems to be an issue in your (current) environment.
- Not viable in your (current) environment due to hosts overwriting enrollment.
- A new, osquery-specific UUID will be generated and stored in the osquery DB for each host
- Works until a VM image is copied with the osquery DB already initialized (though host_identifier=uuid will fail in the same way)
- Changing this now will cause Fleet to see every host as a fresh enrollment, leading to a single duplicate for each host in Fleet. The duplicates will have to be cleaned up later (though this can be automated with the host_expiry setting in Fleet).
Redeploy offending hosts with properly reset UUIDs
- No idea if this is viable for your situation, but if the duplicate issue described above seems worse than doing this, it is worth considering
11/10/2020, 3:54 PM
Right, I’ll look into it. Thanks!
11/10/2020, 4:54 PM
Please let me know how that goes. Of course we also need to fix Fleet to alert the user and not fall over in this situation.
@Alon Starikov we've pushed a cooldown period for host enrollment in Fleet 3.5.0 that is likely to resolve the issue for you. If you have a chance before upgrading we would really appreciate a debug archive. It's easy to do and may help us prevent similar problems in the future.
12/12/2020, 10:33 AM
Apologies, I won’t be able to get around to it this week unfortunately...
I will try to get it done as soon as I can.
host_identifier=instance actually seems to do the trick for me though, I haven’t encountered any problems since changing this setting