Hey all, Just updated fleet to 4.17.0, and having ...
# fleet
a
Hey all, Just updated fleet to 4.17.0, and having a problem with distributed queries. Im getting the following error:
Copy code
{
  "component": "http",
  "err": "error in query ingestion",
  "ingestion-err": "campaign waiting for listener (please retry)",
  "ip_addr": "ENDPOINT-IP:41730",
  "level": "error",
  "method": "POST",
  "took": "1.136788ms",
  "ts": "2022-07-12T20:38:50.060101107Z",
  "uri": "/api/v1/osquery/distributed/write",
  "x_for_ip_addr": ""
}
Also getting (although not sure its related):
Copy code
{
  "component": "http",
  "err": "read auth token: reading from websocket: sockjs: session not in open state",
  "msg": "failed to read auth token",
  "ts": "2022-07-12T20:37:57.77330272Z"
}
The problem appears to be the agent talking back to the fleet server, because I can see the query being run on the agent in debug mode. It just seems to fail when posting back the results. Agent is vanilla OSquery 5.1.0 This only started since I updated a few minutes ago from fleet 4.9.1
k
Moving the conversation into a thread. Sorry about that! Are you using a proxy?
a
Nope
Tracert shows 1 direct hop from the agent to the server
server is running in docker though
Did the osquery api endpoint change?
k
There were some changes in 4.13.2 related to websockets. It seems like we're starting to see these issues when websocket traffic isn't allowed because SockJS isn't reliable as a workaround. Can you tell me more about your Fleet/MySQL/Redis setup? I can reach out to the team and see what they suggest we poke at.
a
all on a single server spun up with docker-compose
k
Would you mind sharing your
compose
file?
With any sensitive data removed, of course.
a
I dm'd it to you
Thanks so much for your help!
j
Erm, i'm not sure if this is related or not, but we updated to 4.1.7 yesterday and all our hosts show as offline in fleet.
just happened an hour ago
k
@Jason Cetina Can you check your logs and see if you're seeing similar errors?
Thanks, @Ari Weinberg. I'll bring this to the team and see if we can sort out what's up. I may not get a response until tomorrow, but I'll update this thread as soon as I do.
z
@Jason Cetina what version did you upgrade from? We have seen some issues with load balancers using deprecated SSL configurations that have been removed from support in the Go stdlib HTTP libraries we use.
j
we upgraded from 4.1.6
z
When you say 4.1.7 and 4.1.6, do you mean 4.17.0 and 4.16.0?
j
orry i'm a dummy yes 4.17
from 4.16.0 ->4.17.0
z
@Ari Weinberg can you open your browser devtools on the live query page and see if you are getting any errors in the network tab or the JS console?
j
@Kathy Satterlee fleet logs or osquery logs? do we need debugging set to true?
z
@Jason Cetina can you look at your Fleet server logs for any errors?
j
@zwass do I need debugging on or no?
z
should not need it, but let's see what is in your logs
j
just to be clear, the weird part is that this is only for part of our fleet
and to a different endpoint (internal vs external vip), but we've made no serious changes to any config
anyway, i'm looking for errors
z
that sounds very likely to be related to the LB configuration
j
i think so, too. I will dig again.
nothing in logs
@zwass looks like our lb team rotated a cert roughly in this timeframe.
our cert I should say
anyway, not your problem anymore. Sorry for the noise.
z
Glad to hear it!
a
@zwass
z
You don't have any kind of load balancer or anything? Looks like the LB websockets issue we commonly see.
a
I dont think so
z
So you have Fleet running on a server with docker-compose -- are you connecting directly to that server?
a
Where would the proxy cause an issue? between the server and agent? or between the client (web UI) and server?
z
web UI and server
a
Ahhh. There is a proxy there
z
Typically googling "<name of proxy> websocket configuration" is the best way to address this
a
Will do. Thanks so much!!!!
k
Thanks for the assist, @zwass !
a
Confirmed that's the problem by bypassing the proxy and going directly to the server. @zwass @Kathy Satterlee Thanks so much for all your help!
🦜 1
z
Glad to hear it!
k
Awesome news!
j
@Kathy Satterlee @zwass - just to close the loop here, we didn't have osqueryd configured to use the OS maintained cert bundle in
/etc/ssl/certs/ca-certificates.crt
. The root CA changed for this endpoint and so everything turfed when it got rotated. Not sure how/why it was setup that way. Anyway, it's fixed now.
🚀 1
z
Sweet! Thank you 🙂
j
Again, sorry for the noise.
k
Thanks for the info! And for reaching out :)
👍 1
a
Sooooo, hate to re-open this, but I fixed the websocket issue, and I'm still getting no results... I'm getting the same error on the fleet server as before:
Copy code
{
  "component": "http",
  "err": "error in query ingestion",
  "ingestion-err": "campaign stopped",
  "ip_addr": "AGENT-IP:50524",
  "level": "error",
  "method": "POST",
  "took": "2.406993ms",
  "ts": "2022-07-15T15:11:55.292524773Z",
  "uri": "/api/v1/osquery/distributed/write",
  "x_for_ip_addr": ""
}
Getting the following in the console:
Any tips?
The query is coming through and executing on the agent, but results are not being returned