Happy Friday all! So it looks like I have a polic...
# fleet
m
Happy Friday all! So it looks like I have a policy that is failing to run properly (it has run correctly in the past). Other queries are running in a timely manner, but this one gets to about a 2% response time (with some results provided) and then just sits there. Eventually I get an error of "Fleet's connection to Redis failed (campaign ID <campaign number>)." Redis is on the same system as Fleet, so I wouldn't expect a connection error. When looking at the error log I see this: Mar 8 182036 fleetdm sh[180502]: {"component":"http","err":"error in query ingestion","ingestion-err":"campaignID=1281 stopped","ip_addr":"<IP>","level":"error","method":"POST","took":"280.78365ms","ts":"2024-03-08T182036.297199505Z","uri":"/api/v1/osquery/distributed/write","x_for_ip_addr":"<IP>, <IP>"} The policy is this: SELECT * FROM registry WHERE path LIKE 'HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CloudDomainJoin\JoinInfo\%\TenantId' AND data = '<TENANTID>' Version: 4.46.0
d
@Mike S. are other live queries working ok?
m
Hi @Dherder! Yes, so far I don't see issues with other live queries.
d
If you run the query locally with osqueryi, do you get results?
m
I just had someone test it out and it runs properly.
I did some other troubleshooting, so far I haven't had much luck. CPU and RAM utilization doesn't spike or show any indication of the system being taxed. The fleet process doesn't go beyond 7% CPU utilization. mysqld and redis sit at around 2% and .3% utilization, respectively while the query is running. No errors in the redis-server.log file. No errors in the mysql error.log file. I looked at the fleet logs again and did spot some newer errors - not sure if they are related though: Mar 10 020647 fleetdm sh[1170]: {"cron":"vulnerabilities","err":"msrc sync: remove /tmp/vulndbs/fleet_msrc_Windows_10-2024_03_09.json: no such file or directory","level":"error","msg":"updating msrc definitions","ts":"2024-03-10T020647.974147513Z"} Mar 10 044706 fleetdm sh[1170]: {"component":"http","err":"json decoder error","internal":"unexpected EOF","level":"info","path":"/api/v1/osquery/distributed/write","ts":"2024-03-10T044706.942060882Z","uuid":"UUID"} Mar 11 010701 fleetdm sh[1170]: {"cron":"vulnerabilities","err":"msrc sync: remove /tmp/vulndbs/fleet_msrc_Windows_11-2024_03_10.json: no such file or directory","level":"error","msg":"updating msrc definitions","ts":"2024-03-11T010701.468887387Z"} Mar 11 052631 fleetdm sh[1170]: {"component":"http","err":"json decoder error","internal":"EOF","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2024-03-11T052631.912572452Z","uuid":"UUID"} Mar 11 055902 fleetdm sh[1170]: {"component":"http","err":"json decoder error","internal":"unexpected EOF","level":"info","path":"/api/v1/osquery/distributed/write","ts":"2024-03-11T055902.939625879Z","uuid":"UUID"} Mar 11 133250 fleetdm sh[1170]: {"component":"http","err":"json decoder error","internal":"EOF","level":"info","path":"/api/v1/osquery/distributed/read","ts":"2024-03-11T133250.761276112Z","uuid":"UUID"}
d
Since all other queries seem to be working and the sql runs locally, what happens if you re-create the query? Do you simply get no results for all hosts?
m
I deleted and recreated/re-ran the query in Fleet, and it looks like the behavior is the same.
Perhaps I spoke too soon! I ran it again and now it seems to be working properly!
Thanks for your help!
So it looks like I might be back where I started after recreating the query. I checked on the policy and it had only queried around 32 hosts (out of 500+ online Windows hosts). When I ran the query manually it's back to getting a 2% response rate with no errors.
I made a change to the query - Instead of selecting * I changed it to look for the path only. That improved the performance to a 19% response rate.
I'm discovering that this is affecting other queries now, but only those related to Windows hosts. It looks like we're still having issues with Cloudflare blocking/rate limiting these queries, despite putting in bypasses for them.
d
The fact that changing the query by reducing the volume (select path instead of select *) gave more results leads me to believe that some of the hosts have deny listed the query. But if the query ran locally, that is kind of mysterious to me. Do you think you are still seeing Cloudflare blocking occurring? If not, perhaps we could look at a couple of the host osquery logs (from a host where you would expect to see data but are not) to see if there are any errors. Another thing to try would be to schedule the query to run, turn on automations, and see if you get results in osquery_results files, just to rule out anything going weird with redis/live query.
m
So I didn't see any errors in Fleet related to deny listing, where I have for other alerts. I'm looking for some test hosts that are working and not working, and will begin pulling local log data there to see if there is anything to work with there. I'll also give the osquery_results options a shot.
I do still see Cloudflare blocking occurring, but I can't necessarily tie that to this yet. Working on getting CF access so I can do some real-time testing.
I was able to do some testing, and it looks like Windows hosts are still getting blocked due to Command Injection - Common Attack Commands - Do you know if Cloudflare is able to allowlist based on the certificate it presents? I'm thinking that since the certificate the orbit client presents is the same, we could use this to allow traffic and bypass this detection in instances where this certificate is present. I'm looking into this internally as well, just wanted to see if you had any experience with this.
l
Hello @Mike S.! Am working on reproducing this issue: #18110 Any findings on your end? I'm trying to setup a Windows 11 VM connected to Azure AD. Any other information that could help me reproduce?
Do you have logs for the devices that are not responding to this query?
Copy code
C:\Windows\system32\config\systemprofile\AppData\Local\FleetDM\Orbit\Logs\orbit-osquery.log
By searching old threads, it seemed you had a similar issue a year ago around Cloudflare WAF: https://osquery.slack.com/archives/C01DXJL16D8/p1683292947556769?thread_ts=1681501891.903579&amp;cid=C01DXJL16D8 Were you able to update the WAF rules?
m
Hi Lucas - just saw these, thanks for looking into this! Let me put some data together for you.
For the Windows setup: osquery version: 5.11.0 Windows version: We see both Windows 10 and 11 hosts responding. We have ZScaler agents installed on the workstations. From a Fleet perspective: No errors are generating that indicate this query is being denylisted. On hosts that are not responding: I noticed a correlation where vitals are not being retrieved for long periods of time, which could be a sign of a larger issue. On hosts that are responding: Vitals are being fetched on their normal cadences. ZScaler and non-Zscaler IPs are shown responding to the query. For the WAF issue: Yes, we did make some changes from a hardening perspective. We configured the WAF to essentially reflect this - https://fleetdm.com/guides/what-api-endpoints-to-expose-to-the-public-internet and block other traffic. Working on getting the log data, should have that soon.