Basically, my above question is regarding back to ...
# kolide
r
Basically, my above question is regarding back to this issue, and if receiving results 1 by 1 is the desired outcome currently and going forward. https://github.com/kolide/fleet/issues/1895
z
The issue you are referring to was discussing distributed queries. Scheduled query results never go through Redis. Scheduled query results are received in batches as they are sent from the osqueryd clients.
m
How does that batching scale with a large number of hosts if fleet is batching one server at a time?
z
Fleet doesn't batch the osqueryd hosts. They send their logs in batches, and Fleet processes those logs as they come in.
m
When we test the results indicate that the osquery host results are received one at a time, and not simultaneously.
z
Are you discussing scheduled queries or distributed queries?
m
Well, scheduled queries is our main scenario that we need. We need steady logs coming from infrastructure, not ad hoc. I think the distributed query was tested based on your recommendation, but I think the results looked similar. @RPuth?
r
Scheduled queries
z
And what is the problem you are experiencing? Each osqueryd host will send the batched results it has stored over the interval set in
logger_tls_period
.
m
If we add more hosts, the EPS performance per hosts goes down
z
How many hosts are we talking about here? What is the EPS (total, or per host)? Are you using
logger_tls_period: 1
? This is likely to unnecessarily increase the load on the Fleet server. Do you have your Fleet server horizontally scaled? What does the CPU usage look like on the Fleet server? What does the CPU usage look like on the MySQL server?
r
Hosts: At the moment 100 however the range is within [100, 500] EPS: Maximum received at a single point in time is 7786 logger_tls_period: This value is being changed to test different results, say within [1,60] Running Fleet and MySQL on a single VM (8 cores, 300g disk, >64gb RAM) Fleet using 1 -> 3% CPU MySQL using 1 -> 3% CPU
z
What query are you using to generate the logs?
r
For that one in particular, we were querying based off of file_events to generate a large amount. However, previously we queried a more stable area
SELECT * FROM file WHERE directory="/usr/bin/"
to return an exact count of 457 per server. Now, running the query towards a single target returned a beautiful result of 457 within a 1 second time frame. Although, when running against 100 hosts it returned the desired total count of 45700 but over 84 seconds (logger_tls_period was within the range of 1 -> 10 during that test)
In the case of the latter query, I would've expected to see the results returned within a much shorter time frame
z
During that time what was the CPU usage like?
There are all sorts of factors that could cause slight delays, and I'm only really concerned if Fleet is unable to handle the throughput. 1 minute latency seems reasonable given all of the different intervals, periods, etc.
(seems reasonable as a maximum. I'd expect 50th percentile to be closer to those intervals you have set)
What was the interval you set on the query?
m
45700 over 84 seconds is 544 EPS. I'm not sure that's a reasonable maximum in a large environment
z
Certainly not. But I'm not convinced your test was generating any sort of maximum throughput.
m
What would be a good approach to test that? Right now we need to test the scaling before we can think about implementing it into production. As of now the results make it look like the more osquery hosts we add, the lower the EPS per osquery host. With a single osquery host the per host performance seems fine. What's a good way for us to test the impact of 5k hosts all doing 10-100 EPS each? (without actually deploying that many 🙂 )
z
Can you please answer my above question before we continue?
What was the interval you set on the query?
m
Which one?
ah k
@RPuth?
z
Any other information you can provide about the methodology would be helpful
If you are working with 100 hosts and want to simulate 10 EPS on 5k hosts, you'll need to write a query that generates ~500 EPS on those hosts (continuously). From the information you provided above it sounds like you ran a query that generated that many results on each host once.
r
The interval was 1 second
z
Testing a single burst of queries does not sound like an accurate representation of what you are hoping to achieve.
If you can provide the entirety of the configuration you used, that would also be helpful.
r
z
Okay, so for example, you have
config_tls_refresh: 60
. This means that it may take up to 60s for a host to receive a new query you schedule.
This is why you can't measure throughput with a single burst.
If you want an accurate measure of throughput, schedule a query that will generate some known EPS and then verify it over a longer time period. The query that you used above scheduled as a snapshot seems reasonable.
r
Thanks, I'll change the testing to better represent what you are suggesting. Is there a more desired value for the logging_tls_period that can give a better eps representation? I've seen 1, 3, 10, 60, etc.
Appreciate the help in this
z
With 5k hosts I'd be considering a
logger_tls_period
in the 60s range. To simulate this on the smaller number of hosts you might try 1-5s.