Title
#kolide
RPuth

RPuth

11/06/2018, 6:20 PM
Basically, my above question is regarding back to this issue, and if receiving results 1 by 1 is the desired outcome currently and going forward.https://github.com/kolide/fleet/issues/1895
zwass

zwass

11/06/2018, 6:43 PM
The issue you are referring to was discussing distributed queries. Scheduled query results never go through Redis. Scheduled query results are received in batches as they are sent from the osqueryd clients.
m

Martijn Bakkes

11/06/2018, 6:46 PM
How does that batching scale with a large number of hosts if fleet is batching one server at a time?
zwass

zwass

11/06/2018, 6:47 PM
Fleet doesn't batch the osqueryd hosts. They send their logs in batches, and Fleet processes those logs as they come in.
m

Martijn Bakkes

11/06/2018, 6:48 PM
When we test the results indicate that the osquery host results are received one at a time, and not simultaneously.
zwass

zwass

11/06/2018, 6:51 PM
Are you discussing scheduled queries or distributed queries?
m

Martijn Bakkes

11/06/2018, 6:55 PM
Well, scheduled queries is our main scenario that we need. We need steady logs coming from infrastructure, not ad hoc. I think the distributed query was tested based on your recommendation, but I think the results looked similar. @RPuth?
RPuth

RPuth

11/06/2018, 6:55 PM
Scheduled queries
zwass

zwass

11/06/2018, 7:00 PM
And what is the problem you are experiencing? Each osqueryd host will send the batched results it has stored over the interval set in
logger_tls_period
.
m

Martijn Bakkes

11/06/2018, 7:04 PM
If we add more hosts, the EPS performance per hosts goes down
zwass

zwass

11/06/2018, 7:08 PM
How many hosts are we talking about here? What is the EPS (total, or per host)? Are you using
logger_tls_period: 1
? This is likely to unnecessarily increase the load on the Fleet server. Do you have your Fleet server horizontally scaled? What does the CPU usage look like on the Fleet server? What does the CPU usage look like on the MySQL server?
RPuth

RPuth

11/06/2018, 7:17 PM
Hosts: At the moment 100 however the range is within [100, 500] EPS: Maximum received at a single point in time is 7786 logger_tls_period: This value is being changed to test different results, say within [1,60] Running Fleet and MySQL on a single VM (8 cores, 300g disk, >64gb RAM) Fleet using 1 -> 3% CPU MySQL using 1 -> 3% CPU
zwass

zwass

11/06/2018, 7:23 PM
What query are you using to generate the logs?
RPuth

RPuth

11/06/2018, 7:28 PM
For that one in particular, we were querying based off of file_events to generate a large amount. However, previously we queried a more stable area
SELECT * FROM file WHERE directory="/usr/bin/"
to return an exact count of 457 per server. Now, running the query towards a single target returned a beautiful result of 457 within a 1 second time frame. Although, when running against 100 hosts it returned the desired total count of 45700 but over 84 seconds (logger_tls_period was within the range of 1 -> 10 during that test)
7:29 PM
In the case of the latter query, I would've expected to see the results returned within a much shorter time frame
zwass

zwass

11/06/2018, 7:29 PM
During that time what was the CPU usage like?
7:31 PM
There are all sorts of factors that could cause slight delays, and I'm only really concerned if Fleet is unable to handle the throughput. 1 minute latency seems reasonable given all of the different intervals, periods, etc.
7:31 PM
(seems reasonable as a maximum. I'd expect 50th percentile to be closer to those intervals you have set)
7:32 PM
What was the interval you set on the query?
m

Martijn Bakkes

11/06/2018, 7:33 PM
45700 over 84 seconds is 544 EPS. I'm not sure that's a reasonable maximum in a large environment
zwass

zwass

11/06/2018, 7:34 PM
Certainly not. But I'm not convinced your test was generating any sort of maximum throughput.
m

Martijn Bakkes

11/06/2018, 7:37 PM
What would be a good approach to test that? Right now we need to test the scaling before we can think about implementing it into production. As of now the results make it look like the more osquery hosts we add, the lower the EPS per osquery host. With a single osquery host the per host performance seems fine. What's a good way for us to test the impact of 5k hosts all doing 10-100 EPS each? (without actually deploying that many 🙂 )
zwass

zwass

11/06/2018, 7:37 PM
Can you please answer my above question before we continue?
7:38 PM
What was the interval you set on the query?
m

Martijn Bakkes

11/06/2018, 7:38 PM
Which one?
7:38 PM
ah k
7:38 PM
@RPuth?
zwass

zwass

11/06/2018, 7:38 PM
Any other information you can provide about the methodology would be helpful
7:39 PM
If you are working with 100 hosts and want to simulate 10 EPS on 5k hosts, you'll need to write a query that generates ~500 EPS on those hosts (continuously). From the information you provided above it sounds like you ran a query that generated that many results on each host once.
RPuth

RPuth

11/06/2018, 7:42 PM
The interval was 1 second
zwass

zwass

11/06/2018, 7:42 PM
Testing a single burst of queries does not sound like an accurate representation of what you are hoping to achieve.
7:42 PM
If you can provide the entirety of the configuration you used, that would also be helpful.
RPuth

RPuth

11/06/2018, 7:50 PM
7:50 PM
zwass

zwass

11/06/2018, 7:56 PM
Okay, so for example, you have
config_tls_refresh: 60
. This means that it may take up to 60s for a host to receive a new query you schedule.
7:56 PM
This is why you can't measure throughput with a single burst.
7:58 PM
If you want an accurate measure of throughput, schedule a query that will generate some known EPS and then verify it over a longer time period. The query that you used above scheduled as a snapshot seems reasonable.
RPuth

RPuth

11/06/2018, 8:10 PM
Thanks, I'll change the testing to better represent what you are suggesting. Is there a more desired value for the logging_tls_period that can give a better eps representation? I've seen 1, 3, 10, 60, etc.
8:10 PM
Appreciate the help in this
zwass

zwass

11/06/2018, 8:13 PM
With 5k hosts I'd be considering a
logger_tls_period
in the 60s range. To simulate this on the smaller number of hosts you might try 1-5s.