Hey folks - I raised this issue <https://github.co...
# kolide
j
Hey folks - I raised this issue https://github.com/kolide/fleet/issues/2008 yesterday regarding
fleetctl
not returning any results and updated it today with some more info... I'm afraid I'm new to
fleetctl
so I could well be doing something wrong but afaik it seems to be setup correctly. Unfortunately I don't get any error messages just a blank line with no results - any suggestions or guidance on how to get more debug info (
--debug
doesn't show anything) that could point me in the direction of what might be going wrong would be gratefully appreciated! Thanks in advance!
z
Are you able to query via the Fleet UI?
j
yes
z
When you do that, look in the network panel in the devtools and see whether it is using a websocket or XHR requests.
Is your Fleet server behind a load balancer?
j
yes - I believe it is
z
Ah yes I see that it is
Does your load balancer support websockets?
j
...checking
z
I suspect that your load balancer doesn't support websockets... On the browser we have a good library for getting around this with XHR requests. In Go (where fleetctl is written) we don't.
r
Zack, if we run fleetctl directly on 1 backend kolide server, would it help?
z
Yeah if you connect fleetctl directly to one of the servers it ought to work fine.
r
one more thing, if I enable session persistence on the L7 LB , would that help too?
z
What matters is whether the LB supports websockets. It doesn't matter which Fleet server the client connects to as long as (1) Redis is working and (2) websockets are supported for communication with the client.
j
It does appear that websockets is supported in the LB - is there a configuration setting which is required to ensure fleetctl can communicate through them? Also, how might I be able to see more debug information coming back from
fleetctl
- i'm sure I've seen in some issues an amount of debug info being returned after running a command very similar to mine which returns nothing?
z
I did some testing and it does seem that fleetctl should print an error if it is unable to establish the websocket connection. It will also print any errors it encounters from the server, and errors reading from the websocket connection. Can you check a couple more things please? What version of fleetctl are you running (
fleetctl --version
)? Do you see any logs on the server indicating that the client connected? Any errors?
j
Copy code
fleetctl - version 2.0.2
  branch:       master
  revision:     8ca0358bf28173685815b79d8683a4239d629a14
  build date:   2019-01-18T00:39:59Z
  build user:   zwass
  go version:   go1.11.3
Do you mean logs on the fleet server? If so, not sure where I'd find them (sorry)?
z
Yeah, on the Fleet server. They are printed to stderr of the Fleet process.
Can you please clarify one more thing... In the examples in https://github.com/kolide/fleet/issues/2008 are both of those commands exiting immediately (or near immediately)?
j
I will try and get the stderr output - in the meantime, yes, they exit immediately.
z
Can you show the output of
fleetctl get label 'All Hosts'
? Also, a screenshot of the Fleet dashboard page sidebard with the labels. Like this:
j
Copy code
$ fleetctl get label 'All Hosts'
apiVersion: v1
kind: label
spec:
  ID: 0
  description: All hosts which have enrolled in Kolide
  label_type: 1
  name: All Hosts
  query: select 1;
$
z
I am still interested in the Fleet server logs. Also, are you able to create a label with a query like
select 1
(which should include all hosts) and query against that label?
j
OK - I can see that although the
fleetctl
fails, the results are being written to
/var/log/kolide/status.log
on each of the 3 load balanced kolide servers.
The number of hosts online has greatly reduced now (not sure why)
...and now, queries are working as expected for 171 hosts online
6% responded (99% online) | 170/2680 targeted hosts (170/171 online)
I created a label `test label`:
Copy code
$ fleetctl get label --context my_context 'test label'
apiVersion: v1
kind: label
spec:
  ID: 0
  description: ""
  label_type: 0
  name: test label
  query: select 1
$
@zwass - any more thoughts on this please? Current situation is that when hosting fleet on multiple servers through a load balancer,
fleetctl
will execute and immediately fail:
Copy code
[----------]$ fleetctl query --timeout 25m --query 'select name,version from os_version' --labels 'All Hosts' --exit
⠋ [----------]$
[----------]$ fleetctl query --timeout 25m --query 'select name,version from os_version' --labels 'All Hosts'
0% responded (0% online) | 0/0 targeted hosts (0/0 online)
[----------]$
On closer inspection of the
/var/log/kolide/status.log
on each of the 3 load balanced kolide servers I can see that data is actually being captured - it's just not being returned to the
fleetctl
call. Is there a timing setting that might be causing
fleetctl
to drop out before any results have been captured?
z
Are you able to query that test label?
j
Yes - I can query against my 'test label' and 'All Hosts'.
fleetctl
seems to work more reliably with just a small number of hosts online (even through our load balancer). It appears to be more of an issue when we have a couple thousand hosts online.
z
When you query the test label you get results? What about if you query individual hosts?
j
I don't think the problem is to do with the label or individual host queries - it appears to be related to how load balanced fleet servers communicate back to
fleetctl
. As a workaround, we have reduced the number of fleet servers down to 1 and early testing appears to indicate that
fleetctl
is working reliably now.
Hi @zwass - I'm still having issues with this environment despite reducing the number of fleet servers to 1 through the load balancer. I've managed to look in the fleet stderr log in docker and I get a few of the following errors occurring around the time that
fleetctl
fails:
Copy code
{"log":"{\"err\":\"sending: sockjs: session not in open state\",\"msg\":\"error writing to channel\",\"ts\":\"2019-03-24T04:08:13.081053265Z\"}\n","stream":"stderr","time":"2019-03-24T04:08:13.08115086Z"}
{"log":"{\"component\":\"service\",\"err\":null,\"ip_addr\":\"192.168.20.3:60635\",\"method\":\"SubmitDistributedQueryResults\",\"took\":\"8.070283ms\",\"ts\":\"2019-03-24T04:08:13.086910133Z\"}\n","stream":"stderr","time":"2019-03-24T04:08:13.087039196Z"}
{"log":"{\"err\":\"write status\",\"msg\":\"error updating status\",\"ts\":\"2019-03-24T04:08:13.322847484Z\"}\n","stream":"stderr","time":"2019-03-24T04:08:13.32299169Z"}
Any idea what might be causing this?
z
Have you tried connecting directly to Fleet rather than through the LB? This seems like an LB issue.