Title
#fleet
a

Alon Starikov

11/10/2020, 12:33 PM
Hey guys, my fleet process started acting weird lately. It starts then takes over all of the CPU and RAM and stops after a few minutes. I have about 15000 hosts and this has never happened before, any ideas?
zwass

zwass

11/10/2020, 2:42 PM
What version of Fleet are you running?
a

Alon Starikov

11/10/2020, 2:55 PM
Currently 3.0.0, planning on upgrading to 3.3.0 in the near future
zwass

zwass

11/10/2020, 3:21 PM
Do your logs by chance include many requests to the EnrollAgent endpoint?
a

Alon Starikov

11/10/2020, 3:27 PM
Yes
zwass

zwass

11/10/2020, 3:30 PM
Is that expected for you? Do you have quite a few new agents enrolling?
3:31 PM
If not, is it possible you've deployed a number of hosts with the same hardware UUID? Perhaps by copying a VM?
a

Alon Starikov

11/10/2020, 3:38 PM
That might be the case, is that the cause?
zwass

zwass

11/10/2020, 3:49 PM
We saw similar with another user. The problem is that enrollment is a bit of an expensive operation and if there are multiple hosts that appear to be the same host to Fleet they will continually overwrite the enrollment.
3:51 PM
Here are some notes from that conversation: Status Quo (host_identifier=uuid)- Works until hosts have the same UUID. Seems to be an issue  in your (current) environment. - Not viable in your (current) environment due to hosts overwriting enrollment. host_identifier=instance- A new, osquery-specific UUID will be generated and stored in the osquery DB for each host - Works until a VM image is copied with the osquery DB already initialized (though host_identifier=uuid will fail in the same way) - Changing this now will cause Fleet to see every host as a fresh enrollment, leading to a single duplicate for each host in Fleet. The duplicates will have to be cleaned up later (though this can be automated with the host_expiry setting in Fleet). Redeploy offending hosts with properly reset UUIDs- No idea if this is viable for your situation, but if the duplicate issue described above seems worse than doing this, it is worth considering
a

Alon Starikov

11/10/2020, 3:54 PM
Right, I’ll look into it. Thanks!
zwass

zwass

11/10/2020, 4:54 PM
Please let me know how that goes. Of course we also need to fix Fleet to alert the user and not fall over in this situation.
10:22 PM
@Alon Starikov are you still encountering this issue? Would it be possible for you to generate a debug archive so that I can try to understand what is going on (https://github.com/fleetdm/fleet/blob/master/docs/infrastructure/performance.md#generate-debug-archive-fleet-340)? I am going to implement a fix that will rate limit enrollment but I'd also really like to debug the issue that is being triggered before that is fixed.
5:19 PM
@Alon Starikov we've pushed a cooldown period for host enrollment in Fleet 3.5.0 that is likely to resolve the issue for you. If you have a chance before upgrading we would really appreciate a debug archive. It's easy to do and may help us prevent similar problems in the future.
a

Alon Starikov

12/12/2020, 10:33 AM
Apologies, I won’t be able to get around to it this week unfortunately... I will try to get it done as soon as I can. host_identifier=instance actually seems to do the trick for me though, I haven’t encountered any problems since changing this setting