Hello! I'm new to managing an osquery instance and...
# general
c
Hello! I'm new to managing an osquery instance and something I've noticed is that some distributed queries I make have extremely long tails. I have around 3000 machines online but for certain queries I can only get ~1000 to respond in a reasonable amount of time, but for others I can usually get 95% of machines to respond quickly. What tools are recommended for debugging these types of problems?
f
Not sure of what tools are out there to debug query response time, but if you share a representative query that has this long tail, some of us may be able to speculate on the potential cause.
c
Sure, the one that finishes is a query looking for all crontab information. The one that hangs with a long tail is a query looking for processes that have listening ports, whose query is
select p.name, p.path, lp.port, lp.address, lp.protocol  from listening_ports lp LEFT JOIN processes p ON lp.pid = p.pid WHERE lp.port != 0 AND p.name != '';
f
Perhaps a naive question, but is there any consistent pattern in terms of the unresponsive devices (eg. only Windows devices are slow)?
c
Let me check, I'd expect there's some underlying pattern but they're all running on cenotOS so it wouldn't be something like that
Nah, there doesn't seem to be really any pattern that I can detect based on the machine name
f
my experience troubleshooting unresponsive centos devices is pretty limited unfortunately 😕
👍 1
c
ok, thanks for trying!
f
My troubleshooting steps if I was trying to resolve this would be. 1. Identify (if possible) a reproducible query/cohort that always long-tails 2. Examine whether all queries are slow to return (eg.
SELECT * FROM system_info
) 3. If other queries return quickly and it is limited to this listening ports query, start breaking apart the query into individuals tables (eg.
SELECT * FROM processes
,
SELECT * FROM listening_ports
) 4. If possible, put physical hands on one of the repeat offending devices and determine, is this behavior reproducible when querying the device locally (eg. in osqueryi) 5. If it is reproducible in osqueryi, run osqueryi with
.timer ON
flag and perhaps the
EXPLAIN QUERY PLAN
command to see if you can chase down the source of the response delay. 6. Establish a pattern for why some CentOS devices are susceptible to this effect and others are not.
❤️ 1
c
Wow thanks!!! I've been struggling to figure out how to manage the health of my instance, this looks like a great starting point.
s
Perhaps a related question… How are you managing these osquery instances? Are you confident that the fleet manager is distributing the query promptly, and that it’s pushing results promptly? Because my first instinct is that a long tail on a simple query is the fleet manager