But, reviewing the locker code — <https://github.c...
# golang
s
But, reviewing the locker code — https://github.com/osquery/osquery-go/blob/master/locker.go Does anything timeout a lock?
a
In my tests as far as I remember the lock timed out right away if the there is already a query in-flight on the same connection and you try to execute another query. The default lock timeout is 200ms at the moment.
If we run a query that takes 3 minutes, with 1 minute transport timeout for example, then we see i/o timeout errors for another 2 minutes, until the original query finishes running, last time I tested. so the connection becomes unusable for other queries.
s
That’s what I thought you’re saying.
a
the workaround that I tested and seems to be working is to open a new connection for each inidividual query
s
Do you know where we set a transport timeout?
huh a new connection per query. That sounds brilliant
a
when the osquery-go client is created
well, I tested on 500K query runs the new connection per query, it completed ok on Mac OS fine, it crashed osqueryd proc on Windows after like 200K+ queries.
s
I don’t think we expose any kinds of socket transport timeouts https://github.com/osquery/osquery-go/blob/master/client.go#L68C1-L68C1
In your test, are those concurrent, or sequential? If it’s sequential, I assume there’s some leak somewhere
a
so would have to look more into it. but at this point probably would have to capture the crash and restart, since it is fairly rare.... and if I open a new connection after error only then it's going to be even rarer
s
My gut sense is that this is a deep issue with how this is using thrift today. And that it’s more correct to open a new connection, instead of pipelining everything into one. (or possible opening a couple and multiplexing, but I wonder how well that would work)
a
in my tests I was running 5 go routines in parallel that executed 100K queries with creating new connection for each query
s
Ha. That’s a lot
a
yeah, just wanted to check if there will be in the issues long term
s
I don’t think this indicates an issue with the recent locking changes. It’s another bug. (And arguably fixing it would have fixed the locking issue better)
a
the worst issue probably that the long running queries are still running in osqueryd after transport timeout bloating memory and pegging CPU
yeah don't think there is an issue with locking
just general RPC implementation is not handling these kind of failures well, can't reuse the connection after the transport timeouts errors
s
I think this is reasonable to fix. But I’m not sure what the best fix is
a
I'm not familiar as well, would have to dig and see how the RPC is handled in osquery itself