https://github.com/osquery/osquery logo
Title
a

Ari Weinberg

07/12/2022, 8:45 PM
Hey all, Just updated fleet to 4.17.0, and having a problem with distributed queries. Im getting the following error:
{
  "component": "http",
  "err": "error in query ingestion",
  "ingestion-err": "campaign waiting for listener (please retry)",
  "ip_addr": "ENDPOINT-IP:41730",
  "level": "error",
  "method": "POST",
  "took": "1.136788ms",
  "ts": "2022-07-12T20:38:50.060101107Z",
  "uri": "/api/v1/osquery/distributed/write",
  "x_for_ip_addr": ""
}
Also getting (although not sure its related):
{
  "component": "http",
  "err": "read auth token: reading from websocket: sockjs: session not in open state",
  "msg": "failed to read auth token",
  "ts": "2022-07-12T20:37:57.77330272Z"
}
The problem appears to be the agent talking back to the fleet server, because I can see the query being run on the agent in debug mode. It just seems to fail when posting back the results. Agent is vanilla OSquery 5.1.0 This only started since I updated a few minutes ago from fleet 4.9.1
k

Kathy Satterlee

07/12/2022, 9:27 PM
Moving the conversation into a thread. Sorry about that! Are you using a proxy?
a

Ari Weinberg

07/12/2022, 9:27 PM
Nope
Tracert shows 1 direct hop from the agent to the server
server is running in docker though
Did the osquery api endpoint change?
k

Kathy Satterlee

07/12/2022, 9:52 PM
There were some changes in 4.13.2 related to websockets. It seems like we're starting to see these issues when websocket traffic isn't allowed because SockJS isn't reliable as a workaround. Can you tell me more about your Fleet/MySQL/Redis setup? I can reach out to the team and see what they suggest we poke at.
a

Ari Weinberg

07/12/2022, 9:53 PM
all on a single server spun up with docker-compose
k

Kathy Satterlee

07/12/2022, 10:05 PM
Would you mind sharing your
compose
file?
With any sensitive data removed, of course.
a

Ari Weinberg

07/12/2022, 10:27 PM
I dm'd it to you
Thanks so much for your help!
j

Jason Cetina

07/12/2022, 10:33 PM
Erm, i'm not sure if this is related or not, but we updated to 4.1.7 yesterday and all our hosts show as offline in fleet.
just happened an hour ago
k

Kathy Satterlee

07/12/2022, 11:04 PM
@Jason Cetina Can you check your logs and see if you're seeing similar errors?
Thanks, @Ari Weinberg. I'll bring this to the team and see if we can sort out what's up. I may not get a response until tomorrow, but I'll update this thread as soon as I do.
z

zwass

07/12/2022, 11:09 PM
@Jason Cetina what version did you upgrade from? We have seen some issues with load balancers using deprecated SSL configurations that have been removed from support in the Go stdlib HTTP libraries we use.
j

Jason Cetina

07/12/2022, 11:10 PM
we upgraded from 4.1.6
z

zwass

07/12/2022, 11:10 PM
When you say 4.1.7 and 4.1.6, do you mean 4.17.0 and 4.16.0?
j

Jason Cetina

07/12/2022, 11:11 PM
orry i'm a dummy yes 4.17
from 4.16.0 ->4.17.0
z

zwass

07/12/2022, 11:12 PM
@Ari Weinberg can you open your browser devtools on the live query page and see if you are getting any errors in the network tab or the JS console?
j

Jason Cetina

07/12/2022, 11:12 PM
@Kathy Satterlee fleet logs or osquery logs? do we need debugging set to true?
z

zwass

07/12/2022, 11:12 PM
@Jason Cetina can you look at your Fleet server logs for any errors?
j

Jason Cetina

07/12/2022, 11:12 PM
@zwass do I need debugging on or no?
z

zwass

07/12/2022, 11:13 PM
should not need it, but let's see what is in your logs
j

Jason Cetina

07/12/2022, 11:13 PM
just to be clear, the weird part is that this is only for part of our fleet
and to a different endpoint (internal vs external vip), but we've made no serious changes to any config
anyway, i'm looking for errors
z

zwass

07/12/2022, 11:14 PM
that sounds very likely to be related to the LB configuration
j

Jason Cetina

07/12/2022, 11:15 PM
i think so, too. I will dig again.
nothing in logs
@zwass looks like our lb team rotated a cert roughly in this timeframe.
our cert I should say
anyway, not your problem anymore. Sorry for the noise.
z

zwass

07/12/2022, 11:25 PM
Glad to hear it!
a

Ari Weinberg

07/12/2022, 11:37 PM
@zwass
z

zwass

07/12/2022, 11:38 PM
You don't have any kind of load balancer or anything? Looks like the LB websockets issue we commonly see.
a

Ari Weinberg

07/12/2022, 11:38 PM
I dont think so
z

zwass

07/12/2022, 11:39 PM
So you have Fleet running on a server with docker-compose -- are you connecting directly to that server?
a

Ari Weinberg

07/12/2022, 11:39 PM
Where would the proxy cause an issue? between the server and agent? or between the client (web UI) and server?
z

zwass

07/12/2022, 11:39 PM
web UI and server
a

Ari Weinberg

07/12/2022, 11:39 PM
Ahhh. There is a proxy there
z

zwass

07/12/2022, 11:40 PM
Typically googling "<name of proxy> websocket configuration" is the best way to address this
a

Ari Weinberg

07/12/2022, 11:40 PM
Will do. Thanks so much!!!!
k

Kathy Satterlee

07/12/2022, 11:40 PM
Thanks for the assist, @zwass !
a

Ari Weinberg

07/12/2022, 11:42 PM
Confirmed that's the problem by bypassing the proxy and going directly to the server. @zwass @Kathy Satterlee Thanks so much for all your help!
๐Ÿ˜›artyparrot: 1
z

zwass

07/12/2022, 11:43 PM
Glad to hear it!
k

Kathy Satterlee

07/12/2022, 11:43 PM
Awesome news!
j

Jason Cetina

07/12/2022, 11:50 PM
@Kathy Satterlee @zwass - just to close the loop here, we didn't have osqueryd configured to use the OS maintained cert bundle in
/etc/ssl/certs/ca-certificates.crt
. The root CA changed for this endpoint and so everything turfed when it got rotated. Not sure how/why it was setup that way. Anyway, it's fixed now.
๐Ÿš€ 1
z

zwass

07/12/2022, 11:50 PM
Sweet! Thank you ๐Ÿ™‚
j

Jason Cetina

07/12/2022, 11:51 PM
Again, sorry for the noise.
k

Kathy Satterlee

07/12/2022, 11:51 PM
Thanks for the info! And for reaching out :)
๐Ÿ‘ 1
a

Ari Weinberg

07/15/2022, 3:13 PM
Sooooo, hate to re-open this, but I fixed the websocket issue, and I'm still getting no results... I'm getting the same error on the fleet server as before:
{
  "component": "http",
  "err": "error in query ingestion",
  "ingestion-err": "campaign stopped",
  "ip_addr": "AGENT-IP:50524",
  "level": "error",
  "method": "POST",
  "took": "2.406993ms",
  "ts": "2022-07-15T15:11:55.292524773Z",
  "uri": "/api/v1/osquery/distributed/write",
  "x_for_ip_addr": ""
}
Getting the following in the console:
Any tips?
The query is coming through and executing on the agent, but results are not being returned