is there a parameter I can pass to configs to addr...
# fleet
b
is there a parameter I can pass to configs to address this issue:
Copy code
save enroll failed: host identified by 1234123-1234-1234-1234-C3C04F373533 enrolling too often
Also, seeing
Copy code
authentication error: missing node key
and
Copy code
enroll failed: no matching secret found
and finally
Copy code
failed to mark host seen: marking host seen: Error 1205: Lock wait timeout exceeded; try restarting transaction
These errors make up less than 0.8% total traffic from osquery to our elk stack
z
Which version of Fleet are you on?
This usually means you have multiple hosts with the same UUIDs. The issue can be addressed by setting
--host_identifier=instance
in your osquery flagfile, or in Fleet 3.9.0 you can configure it within Fleet itself: https://github.com/fleetdm/fleet/blob/master/docs/2-Deployment/2-Configuration.md#osquery_host_identifier
b
3.9.0 is the version
s
As an aside to this I noticed that if you already have hosts showing up with duplicate ID’s, changing to
host_identifier=instance
doesn’t help because the hosts already have the duplicated id stored in their osquery backing store and won’t regenerate a new one. Only new hosts that pick up that config change will have newly generated id’s.
@zwass this
b
Would redeploying to the hosts fix it?
z
It sounds like using the setting in Fleet would probably be your easiest option.
@Scott Lampert are you talking about setting
host_identifier=instance
from the osquery options within Fleet?
s
@zwass Both. Once osquery boots up with any sort of config that stores its uuid in the osquery backing store it won’t change unless you either remove the backing store and restart with
instance
enabled in the flags or use
ephemeral
in the flags. The issue on the fleet side is that if you have a bunch of nodes trying to enroll with the same id already you really need to use the cooldown or the database will get thousand of lock contentions and fall over (we have 120,000+ nodes checking into fleet). If a large portion of those nodes are stuck with a non-unique id they never get to enroll since the rate of nodes trying to enroll will always trigger the cooldown. This means you can’t really count on any osquery config changes in fleet to be picked up related to uuid. This might not be an issue until a certain scale.
z
@Scott Lampert is it possible that what you are seeing is that an already-enrolled osquery database was copied over to multiple hosts? Otherwise that sounds like a bug in osquery, as
instance_identifier
should be generated separately for any installation, regardless of existence/value of UUID.
s
The symptom we saw is that osquery was misconfigured locally to not have any
host_identifier
settings on a few thousand hosts exhibiting the above behavior. We found that even ssh’ing into the host and re-running with
--host_identifier=instance
fleet would still see the original duplicate hw uuid regardless of that setting. If we set it to
ephemperal
it would also work correctly. If we deleted the backing store and restarted it with
instance
it also would show up correctly.
Just changing it to
instance
would not seem to generate a new uuid in the osquery info.
once it already had one.
z
instance_id
is the column Fleet would use if you configure https://github.com/fleetdm/fleet/blob/master/docs/2-Deployment/2-Configuration.md#osquery_host_identifier. That should be unique per osquery database and if it's not that's a bug (please file an issue).
s
It is but only if you initially used
instance
.
instance
stores the id in the backing store once it’s generated. Otherwise you would want ephemeral.
This is by design in osquery
instance uses an instance-unique UUID generated at process start, persisted in the backing store.
So once it has an id in the backing store changing it to
instance
will not generate a new uuid
You would have to use
instance
before osquery makes its db the first time.