anyone knows what is the err meaning when i start ...
# fleet
w
anyone knows what is the err meaning when i start the fleet? Sep 16 034719 n107-019-021 fleet[1560407]: {"component":"http","err":"authentication error: invalid node key: JlBVRLpv/doDpN1CvShCIpZpnfCERea0","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-16T034719.825997124Z"}
m
This just means that one of the hosts that was enrolled with fleet has an invalid node key. In most cases, the osqueryd running on the host should successfully re-enroll if they have a valid enroll secret. Do you see this message repeated?
w
yes i did., may i know how to locate the info of host failed on enrollment?
k
You can grab that from the Rest API at
api/vu/fleet/hosts/identifier/<key from the error>
https://fleetdm.com/docs/using-fleet/rest-api#get-host-by-identifier Hope that helps!
Though I just realized that this may not return if you're getting an invalid node key. 🤦
Please give it a go and let me know what happens.
w
well, i just see a lot of same err requests from our log. and is there anyway i can locate that data from fleet db?
k
You could query the database directly, yes. But if nothing came back from the API call, I don't believe you'll get anything back from there either. Can you share what the response was from the Rest API? I realize there was a typo in the endpoint the first time I gave it:
Copy code
<your fleet address>/api/v1/fleet/hosts/identifier/<node key>
Or, using MySQL to query the Fleet db:
Copy code
SELECT id, hostname FROM hosts WHERE node_key=<node key>
w
let me try other case.
could u help to explain?
m
We commonly see “context canceled” errors when queries to the database are taking too long and timing out. Can you run the following on your database?
Copy code
show engine innodb status;
show processlist;
w
Sep 19 204153 n121-008-225 fleet[3648337]: {"component":"http","err":"authentication error: find host: timestamp: 2022-09-19T203614Z: context canceled","level":"info","path":"/api/v1/osquery/config","ts":"2022-09-19T203614.727078865Z"}
this is the err i c in the log.
@Michal Nicpon i suffering the same issue again and i got this when i run show engine innodb status;
could u help to explain what is the issue of fleet?
m
Hmm, do you notice any particular patterns for when you start seeing these errors? There is an interesting error I saw
Copy code
Sep 16 19:41:51 n107-019-021 fleet[2438691]: 2022/09/16 19:41:51 http: Accept error: accept tcp [::]:8080: accept4: too many open files; retrying in 5ms
Which suggests that maybe your fleet instance is trying to handle too many requests. Can you give me some information about your architecture? • How many fleet instances are you running? How much memory and cpu do they have? • How many hosts are enrolled with fleet?
w
• How many fleet instances are you running? How much memory and cpu do they have? • 1, mem:no limit cpu need to check it out • How many hosts are enrolled with fleet? 20k
m
Hmm, do you notice any particular patterns for when you start seeing these errors?
For example, do they happen every hour or do you see these errors consistently?
w
cpu info root@n107-019-021:/# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Stepping: 7 CPU MHz: 3599.998 BogoMIPS: 5999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-7
i c a lot of errs have parttern like
i restart fleet and right now i c a lot of errs like: Sep 21 170730 n107-019-021 fleet[3065443]: {"component":"http","err":"authentication error: find host: dial tcp 127.0.0.13306 socket: too many open files","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T170730.063073026Z"} Sep 21 170730 n107-019-021 fleet[3065443]: {"component":"http","err":"authentication error: find host: dial tcp 127.0.0.13306 socket: too many open files","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T170730.063076727Z"} Sep 21 170730 n107-019-021 fleet[3065443]: {"component":"http","err":"authentication error: find host: dial tcp 127.0.0.13306 socket: too many open files","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T170730.063099665Z"}
@Michal Nicpon is there any update?
m
Copy code
too many open files
This can be caused by having the ulimit for user running fleet being set too low. See https://fleetdm.com/docs/deploying/faq#what-do-i-do-about-too-many-open-files-errors
If you are running fleet as a service using systemd, you would need to increase the limit in the service file eg.
Copy code
LimitNOFILE=8192
k
@Michal Nicpon Just for some context from a separate thread, we did find an issue with the vulnerabilities setup. The database path has been added now and we're still seeing some context cancelled errors. I've suggested giving it a little time for that initial load to level out then checking back in to see what things look like.
@wennan.he, Let's continue the conversation over there to make sure that all of the data is in one spot: https://osquery.slack.com/archives/C01DXJL16D8/p1663449636947269
w
ok sur