But reviewing the locker code <https github com osquery osqu osquery #golang

But, reviewing the locker code — <https://github.c...

seph

09/27/2023, 2:02 PM

But, reviewing the locker code — https://github.com/osquery/osquery-go/blob/master/locker.go Does anything timeout a lock?

Aleksandr Maus

09/27/2023, 2:20 PM

In my tests as far as I remember the lock timed out right away if the there is already a query in-flight on the same connection and you try to execute another query. The default lock timeout is 200ms at the moment.

Aleksandr Maus

09/27/2023, 2:23 PM

If we run a query that takes 3 minutes, with 1 minute transport timeout for example, then we see i/o timeout errors for another 2 minutes, until the original query finishes running, last time I tested. so the connection becomes unusable for other queries.

seph

09/27/2023, 2:24 PM

That’s what I thought you’re saying.

Aleksandr Maus

09/27/2023, 2:24 PM

the workaround that I tested and seems to be working is to open a new connection for each inidividual query

seph

09/27/2023, 2:24 PM

Do you know where we set a transport timeout?

seph

09/27/2023, 2:24 PM

huh a new connection per query. That sounds brilliant

Aleksandr Maus

09/27/2023, 2:24 PM

when the osquery-go client is created

Aleksandr Maus

09/27/2023, 2:25 PM

well, I tested on 500K query runs the new connection per query, it completed ok on Mac OS fine, it crashed osqueryd proc on Windows after like 200K+ queries.

seph

09/27/2023, 2:27 PM

I don’t think we expose any kinds of socket transport timeouts https://github.com/osquery/osquery-go/blob/master/client.go#L68C1-L68C1

seph

09/27/2023, 2:27 PM

In your test, are those concurrent, or sequential? If it’s sequential, I assume there’s some leak somewhere

Aleksandr Maus

09/27/2023, 2:27 PM

so would have to look more into it. but at this point probably would have to capture the crash and restart, since it is fairly rare.... and if I open a new connection after error only then it's going to be even rarer

seph

09/27/2023, 2:28 PM

My gut sense is that this is a deep issue with how this is using thrift today. And that it’s more correct to open a new connection, instead of pipelining everything into one. (or possible opening a couple and multiplexing, but I wonder how well that would work)

Aleksandr Maus

09/27/2023, 2:28 PM

in my tests I was running 5 go routines in parallel that executed 100K queries with creating new connection for each query

seph

09/27/2023, 2:29 PM

Ha. That’s a lot

Aleksandr Maus

09/27/2023, 2:29 PM

yeah, just wanted to check if there will be in the issues long term

seph

09/27/2023, 2:30 PM

I don’t think this indicates an issue with the recent locking changes. It’s another bug. (And arguably fixing it would have fixed the locking issue better)

Aleksandr Maus

09/27/2023, 2:30 PM

the worst issue probably that the long running queries are still running in osqueryd after transport timeout bloating memory and pegging CPU

Aleksandr Maus

09/27/2023, 2:31 PM

yeah don't think there is an issue with locking

Aleksandr Maus

09/27/2023, 2:33 PM

just general RPC implementation is not handling these kind of failures well, can't reuse the connection after the transport timeouts errors

seph

09/28/2023, 4:18 AM

I think this is reasonable to fix. But I’m not sure what the best fix is

Aleksandr Maus

10/06/2023, 1:26 PM

I'm not familiar as well, would have to dig and see how the RPC is handled in osquery itself

2 Views

Open in Slack

Previous Next