https://github.com/osquery/osquery logo
Title
w

wennan.he

09/16/2022, 3:49 AM
anyone knows what is the err meaning when i start the fleet? Sep 16 03:47:19 n107-019-021 fleet[1560407]: {"component":"http","err":"authentication error: invalid node key: JlBVRLpv/doDpN1CvShCIpZpnfCERea0","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-16T03:47:19.825997124Z"}
m

Michal Nicpon

09/16/2022, 3:24 PM
This just means that one of the hosts that was enrolled with fleet has an invalid node key. In most cases, the osqueryd running on the host should successfully re-enroll if they have a valid enroll secret. Do you see this message repeated?
w

wennan.he

09/16/2022, 5:19 PM
yes i did., may i know how to locate the info of host failed on enrollment?
k

Kathy Satterlee

09/16/2022, 5:37 PM
You can grab that from the Rest API at
api/vu/fleet/hosts/identifier/<key from the error>
https://fleetdm.com/docs/using-fleet/rest-api#get-host-by-identifier Hope that helps!
Though I just realized that this may not return if you're getting an invalid node key. 🤦
Please give it a go and let me know what happens.
w

wennan.he

09/16/2022, 8:19 PM
well, i just see a lot of same err requests from our log. and is there anyway i can locate that data from fleet db?
k

Kathy Satterlee

09/16/2022, 8:35 PM
You could query the database directly, yes. But if nothing came back from the API call, I don't believe you'll get anything back from there either. Can you share what the response was from the Rest API? I realize there was a typo in the endpoint the first time I gave it:
<your fleet address>/api/v1/fleet/hosts/identifier/<node key>
Or, using MySQL to query the Fleet db:
SELECT id, hostname FROM hosts WHERE node_key=<node key>
w

wennan.he

09/16/2022, 8:39 PM
let me try other case.
could u help to explain?
m

Michal Nicpon

09/19/2022, 5:20 PM
We commonly see “context canceled” errors when queries to the database are taking too long and timing out. Can you run the following on your database?
show engine innodb status;
show processlist;
w

wennan.he

09/19/2022, 8:42 PM
Sep 19 20:41:53 n121-008-225 fleet[3648337]: {"component":"http","err":"authentication error: find host: timestamp: 2022-09-19T20:36:14Z: context canceled","level":"info","path":"/api/v1/osquery/config","ts":"2022-09-19T20:36:14.727078865Z"}
this is the err i c in the log.
@Michal Nicpon i suffering the same issue again and i got this when i run show engine innodb status;
could u help to explain what is the issue of fleet?
m

Michal Nicpon

09/21/2022, 4:27 PM
Hmm, do you notice any particular patterns for when you start seeing these errors? There is an interesting error I saw
Sep 16 19:41:51 n107-019-021 fleet[2438691]: 2022/09/16 19:41:51 http: Accept error: accept tcp [::]:8080: accept4: too many open files; retrying in 5ms
Which suggests that maybe your fleet instance is trying to handle too many requests. Can you give me some information about your architecture? • How many fleet instances are you running? How much memory and cpu do they have? • How many hosts are enrolled with fleet?
w

wennan.he

09/21/2022, 4:33 PM
• How many fleet instances are you running? How much memory and cpu do they have? • 1, mem:no limit cpu need to check it out • How many hosts are enrolled with fleet? 20k
m

Michal Nicpon

09/21/2022, 4:35 PM
Hmm, do you notice any particular patterns for when you start seeing these errors?
For example, do they happen every hour or do you see these errors consistently?
w

wennan.he

09/21/2022, 4:35 PM
cpu info root@n107-019-021:/# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz Stepping: 7 CPU MHz: 3599.998 BogoMIPS: 5999.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-7
i c a lot of errs have parttern like
i restart fleet and right now i c a lot of errs like: Sep 21 17:07:30 n107-019-021 fleet[3065443]: {"component":"http","err":"authentication error: find host: dial tcp 127.0.0.1:3306: socket: too many open files","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T17:07:30.063073026Z"} Sep 21 17:07:30 n107-019-021 fleet[3065443]: {"component":"http","err":"authentication error: find host: dial tcp 127.0.0.1:3306: socket: too many open files","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T17:07:30.063076727Z"} Sep 21 17:07:30 n107-019-021 fleet[3065443]: {"component":"http","err":"authentication error: find host: dial tcp 127.0.0.1:3306: socket: too many open files","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2022-09-21T17:07:30.063099665Z"}
@Michal Nicpon is there any update?
m

Michal Nicpon

09/21/2022, 6:38 PM
too many open files
This can be caused by having the ulimit for user running fleet being set too low. See https://fleetdm.com/docs/deploying/faq#what-do-i-do-about-too-many-open-files-errors
If you are running fleet as a service using systemd, you would need to increase the limit in the service file eg.
LimitNOFILE=8192
k

Kathy Satterlee

09/21/2022, 6:42 PM
@Michal Nicpon Just for some context from a separate thread, we did find an issue with the vulnerabilities setup. The database path has been added now and we're still seeing some context cancelled errors. I've suggested giving it a little time for that initial load to level out then checking back in to see what things look like.
@wennan.he, Let's continue the conversation over there to make sure that all of the data is in one spot: https://osquery.slack.com/archives/C01DXJL16D8/p1663449636947269
w

wennan.he

09/21/2022, 6:46 PM
ok sur