Hi team. In many production servers orbit-agent is...
# fleet
l
Hi team. In many production servers orbit-agent is trying to start cyclically, but can't do that. We get errors like :
Copy code
May 19 15:10:39 HOST systemd[1]: Started Orbit osquery.
May 19 15:10:39 HOST orbit[2667636]: 2023-05-19T15:10:39+03:00 INF running with auto updates disabled
May 19 15:10:39 HOST orbit[2667636]: 2023-05-19T15:10:39+03:00 INF token rotation is enabled
May 19 15:10:39 HOST orbit[2667636]: 2023-05-19T15:10:39+03:00 INF start osqueryd cmd="/opt/orbit/bin/osqueryd/linux/5.7.0/osqueryd --pidfile=/opt/orbit/osquery.pid --database_path=/opt/orbit/osquery.db --extensions_socket=/opt/orbit/orbit-osquery.em --logger_path=/opt/orbit/osquery_log --enroll_secret_env ENROLL_SECRET --host_identifier=uuid --tls_hostname=HOSTNAME --enroll_tls_endpoint=/api/v1/osquery/enroll --config_plugin=tls --config_tls_endpoint=/api/v1/osquery/config --config_refresh=60 --disable_distributed=false --distributed_plugin=tls --distributed_tls_max_attempts=10 --distributed_tls_read_endpoint=/api/v1/osquery/distributed/read --distributed_tls_write_endpoint=/api/v1/osquery/distributed/write --logger_plugin=tls,filesystem --logger_tls_endpoint=/api/v1/osquery/log --disable_carver=false --carver_disable_function=false --carver_start_endpoint=/api/v1/osquery/carve/begin --carver_continue_endpoint=/api/v1/osquery/carve/block --carver_block_size=2000000 --tls_server_certs /opt/orbit/certs.pem --augeas_lenses /opt/orbit/lenses --force --flagfile /opt/orbit/osquery.flags"
May 19 15:10:39 HOST osqueryd[2667654]: osqueryd started [version=5.7.0]
May 19 15:11:09 HOST orbit[2667636]: 2023-05-19T15:11:09+03:00 INF calling flags update
May 19 15:11:49 HOST orbit[2667636]: 2023-05-19T15:11:49+03:00 ERR unexpected exit error="extension socket stat timeout"
In Fleet server status of agents is "offline" . What we should do with this error? How can we start agents? Orbit: 1.5.0 Osquery: 5.7.0 Debian: 11 .
r
Hey Lili, can you clarify more what you were trying to do and what command you were running?
l
Hello @Rachel Perkins . Everything worked fine until the Fleet server went down and was not available for about 10 hours. After that, some agents started having problems that they lost contact with the fleet server. And orbit-agent is trying to start cyclically, but can't do that. Restarting the agent on its own does not help. kill -9 <pid orbit> doesn't help either.
r
Hey @Lili, seems like Orbit can't find the extension socket that should be created by osquery. From the log you pasted, the socket file is
/opt/orbit/orbit-osquery.em
Could you: 1. Try deleting the file? if that doesn't work, inspecting permissions? 2. See if there's something in the osquery logs?
opt/orbit/osquery_log/*
l
@roberto hello! Directory
/opt/orbit/osquery_log/
is empty and extension file
/opt/orbit/orbit-osquery.em
not exist
Copy code
# ls -lt /opt/orbit/orbit-osquery.em
ls: cannot access '/opt/orbit/orbit-osquery.em': No such file or directory
l
Something we can try: 1. Stop orbit:
sudo systemctl stop orbit
2. Try starting osquery manually:
Copy code
# As root

/opt/orbit/bin/osqueryd/linux/5.7.0/osqueryd --pidfile=/opt/orbit/osquery.pid --database_path=/opt/orbit/osquery.db --extensions_socket=/opt/orbit/orbit-osquery.em --logger_path=/opt/orbit/osquery_log --enroll_secret_env ENROLL_SECRET --host_identifier=uuid --tls_hostname=HOSTNAME --enroll_tls_endpoint=/api/v1/osquery/enroll --config_plugin=tls --config_tls_endpoint=/api/v1/osquery/config --config_refresh=60 --disable_distributed=false --distributed_plugin=tls --distributed_tls_max_attempts=10 --distributed_tls_read_endpoint=/api/v1/osquery/distributed/read --distributed_tls_write_endpoint=/api/v1/osquery/distributed/write --logger_plugin=tls,filesystem --logger_tls_endpoint=/api/v1/osquery/log --disable_carver=false --carver_disable_function=false --carver_start_endpoint=/api/v1/osquery/carve/begin --carver_continue_endpoint=/api/v1/osquery/carve/block --carver_block_size=2000000 --tls_server_certs /opt/orbit/certs.pem --augeas_lenses /opt/orbit/lenses --force --flagfile /opt/orbit/osquery.flags
to know if it's hanging or crashing during start up.
r
thanks Lucas, that sounds like a great idea, we'll might also get some useful info from stdout
l
@Lucas Rodriguez hello! I stop orbit and start osquery manually. After start osquery I get:
Copy code
W0526 09:38:55.125365 2774620 watcher.cpp:397] osqueryd worker (2774621) stopping: Memory limits exceeded: 1099509760
W0526 09:38:59.159560 2774620 watcher.cpp:435] osqueryd worker (2774621) could not be stopped. Sending kill signal.
Flag watchdog_memory_limit is set to 1024 . For information: osquery.db folder size is 21G . After remove this folder and restart orbit agent all works good.
I made mistake when set buffered_log_max: 0 . Seems like after disconnect with FleetDM server, orbit agent write to disk all results of query(1) , after that can't load all buffered logs(2) and failed on start because there is not enough memory(3). Thank you for help ! @roberto @Lucas Rodriguez