Fleet osquery has stopped in multiple windows mach...
# fleet
m
Fleet osquery has stopped in multiple windows machines, and even starting them from services doesn't work with windows error "Cannot find the file specified" . As I did some debug I found the orbit.exe is missing from the bin/orbit/windows/stable folder (empty folder) and in some cases bin/orbit/windows folder itself is missing. The symlink of orbit.exe is just pointing to some non-existent file. Any idea what could have happened?
Some previous orbit logs
For the record, the 'Fleet osquery' is there in the installed programs as can be seen from "Programs and Features" . But even manual start using Start-Service did not work. My guess is the missing orbit.exe while its auto-update routine? But why it went missing.. not sure
Is it possible that if I cut off the internet during the update of orbit, it will go into the missing state?
t
hi there! it's very unlikely that the issue will be caused by the autoupdate mechanism. We first download the files, then verify then, and then replace the files in their installed location in an atomic rename
let me check with the team to see if any ideas come up as to what might be happening
l
Hi Manish! Could you look for errors in the last log lines of
C:\Windows\system32\config\systemprofile\AppData\Local\FleetDM\Orbit\Logs\orbit-osquery.log
(If this is happening on many hosts, please do check logs in each.) During the update download, there's (1) rename from
orbit.exe
to
orbit.exe.old
, then a second rename (2) from the new downloaded update
tmpFile
file to
orbit.exe
. AFAICS, • if rename (1) fails and the rename is not atomic (we should check if that could happen) then
orbit.exe
could be missing... • if rename (2) fails there should be a
orbit.exe.old
in the
bin/orbit/windows/stable
directory. Any particular file system setup on these hosts? Do all hosts failed the same way? (missing
orbit.exe
)
m
Hi Lucas, I didn't see any error that seemed related to orbit, most errors seemed related to the queries as seen in the above screenshot. (will check once more). There are about 1000 windows nodes enrolled and about 400 are offline now, i think they gradually started going offline in a batch of approx. 20 per day, and have reached the 400 offline mark. I checked two systems myself and in one, the folder
bin/orbit/windows
itself was missing and in the other
orbit.exe
was missing from the
bin/orbit/windows/stable
folder as seen in the picture. AFAIK, most of the machines are offline due to same issue (unable to start the service,
cannot find the file specified
) I am not sure but in some systems they have network mounted drives, but I hope not C drive, could that be an issue? I do not have direct access to the machines, but I can check things if required via some other person. If you could guide me with most probable causes, I will check them off the list to get to the root cause. I can of course uninstall Fleet osquery via puppet in all machines and re-install them via puppet to fix the problem for time being but that might not work if same problem crops up again.
l
Hi Lucas, I didn't see any error that seemed related to orbit, most errors seemed related to the queries as seen in the above screenshot. (will check once more).
OK, let us know if you find any ERR logs, it would help us troubleshoot.
There are about 1000 windows nodes enrolled and about 400 are offline now
Is there any way you can inspect Windows Event logs (Maybe the file deletion events show up there.)
I am not sure but in some systems they have network mounted drives, but I hope not C drive, could that be an issue?
We don't know. Something worth checking though.
I can of course uninstall Fleet osquery via puppet in all machines and re-install them via puppet to fix the problem for time being but that might not work if same problem crops up again.
I would suggest doing this on a couple of hosts and start monitoring for the issue. PS: I've searched our issues and didn't find anything related. Please keep us posted.
And, sorry for the delay in the response.
m
Hi @Lucas Rodriguez, seems kaspersky was the culprit, It was marking it as a trojan.
orbit.exeObjects selected: Properties Description: Action: Device: redacted_for_slack Status: Deleted Object: UDS:Trojan-Downloader.Win32.Agent.xxzyee Date of placement: 04-04-2022 71245 AM Size (bytes): Path: C:\Program Files\Orbit\bin\orbit\windows\stable\orbit.exe User: NT AUTHORITY\SYSTEM
guess, it will not be solved via fleetdm but rather adding an exception in the kaspersky?
l
Hi Manish, glad you found the root cause.
adding an exception in the kaspersky?
For now yes. But I'll be creating an issue to investigate why this is happening. I'll link it here so that you can add all the information you have there.
#5049 Please add any extra information (that you can share) that can help us reproduce. (Windows version? Kaspersky product/version?, etc.)
z
Hi Manish, thank you for reporting this. I've submitted the Orbit executable to Kaspersky as a false positive and we'll see if they can fix it on their end. I agree to go ahead with adding an exception.
The person from Kaspersky said the fix will be out in 2-3 hours. Please let us know if it resolves itself!
m
Great, thanks Zach for the quick turnaround. Will check and let you know, but since the orbit.exe is already deleted, I think I will have to do a fresh uninstall (ensure absent) and then reinstall (ensure installed) of 'Fleet osquery' via puppet on all the affected machines. Will update here.
@zwass I did a uninstall and install of fleet-osquery (through puppet), around 500 hosts' orbit.exe got deleted by kaspersky again during past two weeks. Maybe they didn't push the change? I assume signatures are autoupdated in kaspersky and i do not have to do anything. I guess I will manually have to exclude orbit.exe using some setting in kaspersky dashboard - will check how to do that.
z
Yeah it seems like Kaspersky keeps marking new versions of Orbit as false positives. If you can add an exception that would be good. I will contact Kaspersky again and ask them to fix it.