Hey guys my fleet process started acting weird lately It sta osquery #fleet

Hey guys, my fleet process started acting weird la...

Alon Starikov

11/10/2020, 12:33 PM

Hey guys, my fleet process started acting weird lately. It starts then takes over all of the CPU and RAM and stops after a few minutes. I have about 15000 hosts and this has never happened before, any ideas?

zwass

11/10/2020, 2:42 PM

What version of Fleet are you running?

Alon Starikov

11/10/2020, 2:55 PM

Currently 3.0.0, planning on upgrading to 3.3.0 in the near future

zwass

11/10/2020, 3:21 PM

Do your logs by chance include many requests to the EnrollAgent endpoint?

Alon Starikov

11/10/2020, 3:27 PM

Yes

zwass

11/10/2020, 3:30 PM

Is that expected for you? Do you have quite a few new agents enrolling?

zwass

11/10/2020, 3:31 PM

If not, is it possible you've deployed a number of hosts with the same hardware UUID? Perhaps by copying a VM?

Alon Starikov

11/10/2020, 3:38 PM

That might be the case, is that the cause?

zwass

11/10/2020, 3:49 PM

We saw similar with another user. The problem is that enrollment is a bit of an expensive operation and if there are multiple hosts that appear to be the same host to Fleet they will continually overwrite the enrollment.

zwass

11/10/2020, 3:51 PM

Here are some notes from that conversation: Status Quo (host_identifier=uuid) - Works until hosts have the same UUID. Seems to be an issue in your (current) environment. - Not viable in your (current) environment due to hosts overwriting enrollment. host_identifier=instance - A new, osquery-specific UUID will be generated and stored in the osquery DB for each host - Works until a VM image is copied with the osquery DB already initialized (though host_identifier=uuid will fail in the same way) - Changing this now will cause Fleet to see every host as a fresh enrollment, leading to a single duplicate for each host in Fleet. The duplicates will have to be cleaned up later (though this can be automated with the host_expiry setting in Fleet). Redeploy offending hosts with properly reset UUIDs - No idea if this is viable for your situation, but if the duplicate issue described above seems worse than doing this, it is worth considering

🍻 1

Alon Starikov

11/10/2020, 3:54 PM

Right, I’ll look into it. Thanks!

zwass

11/10/2020, 4:54 PM

Please let me know how that goes. Of course we also need to fix Fleet to alert the user and not fall over in this situation.

zwass

12/07/2020, 10:22 PM

@Alon Starikov are you still encountering this issue? Would it be possible for you to generate a debug archive so that I can try to understand what is going on (https://github.com/fleetdm/fleet/blob/master/docs/infrastructure/performance.md#generate-debug-archive-fleet-340)? I am going to implement a fix that will rate limit enrollment but I'd also really like to debug the issue that is being triggered before that is fixed.

zwass

12/11/2020, 5:19 PM

@Alon Starikov we've pushed a cooldown period for host enrollment in Fleet 3.5.0 that is likely to resolve the issue for you. If you have a chance before upgrading we would really appreciate a debug archive. It's easy to do and may help us prevent similar problems in the future.

Alon Starikov

12/12/2020, 10:33 AM

Apologies, I won’t be able to get around to it this week unfortunately... I will try to get it done as soon as I can. host_identifier=instance actually seems to do the trick for me though, I haven’t encountered any problems since changing this setting

🍻 1

3 Views

Open in Slack

Previous Next