we've been seeing segfaults in osquery since we ro...
# general
r
we've been seeing segfaults in osquery since we rolled out v5.12.2. they affect a pretty small number of linux devices (~50ish), but happen fairly regularly -- on many of these devices, about once an hour. has anyone else seen similar, or could anyone point me in the right direction for troubleshooting? I will 🧵 details
issue originally reported here: https://github.com/kolide/launcher/issues/1773 notable that the devices affected are largely but not entirely rhel/fedora.
I got a core dump from an affected device, but had only made it this far with gdb using the osquery debuginfo rpm:
Copy code
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005652aa357892 in osquery::tables::RpmEnvironmentManager::Callback (rec=<optimized out>, data=0x7ff4acbfb340)
    at /__w/osquery/osquery/workspace/padding-required-by-rpm-packages/src/osquery/tables/system/linux/rpm_packages.cpp:112
112	/__w/osquery/osquery/workspace/padding-required-by-rpm-packages/src/osquery/tables/system/linux/rpm_packages.cpp: No such file or directory.
s
Hello @Rebecca Mahany-Horton, if you could launch the command:
gdb <path to osqueryd> <path to coredump> -ex "thread apply all bt full" -ex quit | tee crash.log
and upload the crash.log to an issue, we can have a look
r
thanks, let me give it a shot
ok, generated with
gdb /usr/lib/debug/opt/osquery/bin/osqueryd-5.12.2-1.linux.x86_64.debug ./osqueryd.coredump-lapis -ex "thread apply all bt full" -ex quit | tee crash.log
. this is not the machine that the coredump occurred on so i am not sure how helpful it is
s
yeah I'm not sure what's wrong with the logging. For instance renaming the
rpmdb.sqlite
causes those lines to be called, but the log is printed correctly and no crash happens. I think this has to be reproduced locally/manually to understand what are the conditions. Although the question is if you have any
RPM notice:
messages in your logs that might help understand what was (part of) the issue maybe.
obviously if it's a single message that causes the crash, you won't see it being logged.
also a very wild guess is to check if those machines have a scheduled query and a distributed query being run somewhat at the same time, both on
rpm_packages
👀 1
r
there were some RPM-related logs, let me grab those
right before all of these segfaults, i see first
glog_logger.cpp:49] RPM notice: Failed to read auxiliary vector, /proc not mounted?
and then 0 to a lot of
glog_logger.cpp:49] Could not get RPM header flag.
logs
this was fairly consistent across all the devices i checked -- maybe ~10 devices
i will look into the theory about a scheduled + distributed query running around the same time on
rpm_packages
and let you know what i find 🤞
s
also a very wild guess is to check if those machines have a scheduled query and a distributed query being run somewhat at the same time, both on rpm_packages
I doubt this. What are you thinking about? If it’s a lock somewhere in rpm, what if it’s a lock outside osquery?
s
I was more thinking about some memory corruption happening, but it was a wild guess for multiple reasons (given also that other places should've crashed). Also maybe the question is, is that always the place where it crashes? Another thing would be looking at the asm of the last instruction, because it's not clear just looking at the source code
r
that's the only coredump i've looked at so far, but i have access to two more from two other machines. i will check to see if those also show the same.
(will take a min)
s
We don’t have perfect information, but: • at least one device seems to be crashing every time the schedule query runs • other ones are less frequent memory corruption seems possible. It would crash every time
r
i think re the first bullet point, it's actually a higher percentage of devices that crash every time the schedule query runs -- at least the 5 noisiest devices are crashing every hour, plus the 3 internal ones that i checked yesterday are crashing every hour. (every hour that the device is online, that is)
s
could you try to do a similar command as before, but substitute the previous
-ex
commands with
-ex disas -ex quit
and log that to another file. That dumps the asm.. it should show better where exactly in the operation happening it's crashing.
r
sure thing, just one minute!
s
sorry was almost forgetting, a
-ex "info registers"
before those two would be helpful too, so that we know the values being used.
r
💯 got it
this one corresponds to the previous crash.log
these are output of those same commands run against a different coredump
same deal, third device coredump from another device for good measure.
s
The disas itself did not help, but it seems that the crash is happening when calling the virtual function vlog, so literally the pointer to that function being followed is messed up (it does have a weird address). I wonder if you have queries to
rpm_packages
that use
pid_with_namespace
. It could also very well be a memory corruption, of that pointer.
r
let me take a look!
hm, no, not for the two devices i just spot-checked. they have queries to
deb_packages
that use
pid_with_namespace
, but no other use of
pid_with_namespace
.
l
Hi folks! We have reproduced this issue in Fedora 40 with the following query:
select 1 from rpm_packages rp, os_version ov where rp.name = "foo-fedora-playbooks" AND ov.name = "Fedora Linux";
. It happens ~3/5 times I execute such query on such system.
Copy code
Jul 23 17:53:20 fedora orbit[6140]: {"node_key":"eYRzBJB8B/uvb8Og4UmSbcWVuQKycyET"}
Jul 23 17:53:20 fedora orbit[6140]: {
Jul 23 17:53:20 fedora orbit[6140]:  "queries": {
Jul 23 17:53:20 fedora orbit[6140]:    "fleet_distributed_query_81": "select 1 from rpm_packages rp, os_version ov where rp.name = \"foo-fedora-playbooks\" AND ov.name = \"Fedora Linux\";"
Jul 23 17:53:20 fedora orbit[6140]:  },
Jul 23 17:53:20 fedora orbit[6140]:  "discovery": {
Jul 23 17:53:20 fedora orbit[6140]:    "fleet_distributed_query_81": "SELECT 1"
Jul 23 17:53:20 fedora orbit[6140]:  }
Jul 23 17:53:20 fedora orbit[6140]: }
Jul 23 17:53:20 fedora orbit[6140]: I0723 17:53:20.171214  6336 distributed.cpp:173] Executing distributed query: fleet_distributed_query_81: select 1 from rpm_packages rp, os_version ov where rp.name = "foo-fedora-playbooks" AND ov.name = "Fedora Linux";
Jul 23 17:53:20 fedora audit[6140]: ANOM_ABEND auid=4294967295 uid=0 gid=0 ses=4294967295 subj=kernel pid=6140 comm="DistributedRunn" exe="/opt/orbit/bin/osqueryd/linux/stable/osqueryd" sig=11 res=1
Jul 23 17:53:20 fedora kernel: traps: DistributedRunn[6336] general protection fault ip:7fc0d605ce5c sp:7fc0c69f9f68 error:0 in libc.so.6[7fc0d5f09000+169000]
Jul 23 17:53:20 fedora audit: BPF prog-id=220 op=LOAD
Jul 23 17:53:20 fedora audit: BPF prog-id=221 op=LOAD
Jul 23 17:53:20 fedora audit: BPF prog-id=222 op=LOAD
Jul 23 17:53:20 fedora systemd[1]: Started systemd-coredump@4-6364-0.service - Process Core Dump (PID 6364/UID 0).
Jul 23 17:53:20 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-coredump@4-6364-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 23 17:53:21 fedora abrt-dump-journal-oops[1085]: abrt-dump-journal-oops: Found oopses: 1
Jul 23 17:53:21 fedora abrt-dump-journal-oops[1085]: abrt-dump-journal-oops: Creating problem directories
Jul 23 17:53:21 fedora systemd-coredump[6365]: Process 6140 (osqueryd) of user 0 dumped core.#012#012Module libnss_myhostname.so.2
Jul 23 17:53:21 fedora systemd-coredump[6365]: Process 6140 (osqueryd) of user 0 dumped core.#012#012Module libnss_myhostname.so.2 from rpm systemd-255.4-1.fc40.x86_64#012Module libcap.so.2 from rpm libcap-2.69-3.fc40.x86_64#012Module libnss_resolve.so.2 from rpm systemd-255.4-1.fc40.x86_64#012Module libnss_mdns4_minimal.so.2 from rpm nss-mdns-0.15.1-11.fc40.x86_64#012Stack trace of thread 6336:#012#0  0x00007fc0d605ce5c __strlen_evex (libc.so.6 + 0x17be5c)#012#1  0x00007fc0d5f474e8 __printf_buffer (libc.so.6 + 0x664e8)#012#2  0x00007fc0d5f6b82f __vsnprintf (libc.so.6 + 0x8a82f)#012#3  0x00005579b2533034 rpmlog (osqueryd + 0x272b034)#012#4  0x00005579b24fb286 errCb (osqueryd + 0x26f3286)#012#5  0x00005579b40e692c sqlite3_log (osqueryd + 0x42de92c)#012#6  0x00005579b414607f resolveExprStep (osqueryd + 0x433e07f)#012#7  0x00005579b4140c79 walkExpr (osqueryd + 0x4338c79)#012#8  0x00005579b4140c94 walkExpr (osqueryd + 0x4338c94)#012#9  0x00005579b41446d1 sqlite3ResolveExprNames (osqueryd + 0x433c6d1)#012#10 0x00005579b4146337 resolveSelectStep (osqueryd + 0x433e337)#012#11 0x00005579b4140d5f sqlite3WalkSelect (osqueryd + 0x4338d5f)#012#12 0x00005579b41471ff sqlite3ResolveSelectNames (osqueryd + 0x433f1ff)#012#13 0x00005579b4147184 sqlite3SelectPrep (osqueryd + 0x433f184)#012#14 0x00005579b412cabc sqlite3Select (osqueryd + 0x4324abc)#012#15 0x00005579b41258a1 yy_reduce (osqueryd + 0x431d8a1)#012#16 0x00005579b40f1e42 sqlite3RunParser (osqueryd + 0x42e9e42)#012#17 0x00005579b411f338 sqlite3Prepare (osqueryd + 0x4317338)#012#18 0x00005579b40f1120 sqlite3LockAndPrepare (osqueryd + 0x42e9120)#012#19 0x00005579b40e80de sqlite3_prepare_v2 (osqueryd + 0x42e00de)#012#20 0x00005579b1d44026 _ZN7osquery23getQueryColumnsInternalERKNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEERNS0_6vectorINS0_5tupleIJS6_NS_10ColumnTypeENS_13ColumnOptionsEEEENS4_ISD_EEEERKNS0_10shared_ptrINS_16SQLiteDBInstanceEEE (osqueryd + 0x1f3c026)#012#21 0x00005579b1d43f4c _ZNK7osquery15SQLiteSQLPlugin15getQueryColumnsERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEERNS1_6vectorINS1_5tupleIJS7_NS_10ColumnTypeENS_13ColumnOptionsEEEENS5_ISE_EEEE (osqueryd + 0x1f3bf4c)#012#22 0x00005579b1d363cb _ZN7osquery9SQLPlugin4callERKNSt3__13mapINS1_12basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEES8_NS1_4lessIS8_EENS6_INS1_4pairIKS8_S8_EEEEEERNS1_6vectorISF_NS6_ISF_EEEE (osqueryd + 0x1f2e3cb)#012#23 0x00005579b40add42 _ZN7osquery17RegistryInterface4callERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEERKNS1_3mapIS7_S7_NS1_4lessIS7_EENS5_INS1_4pairIS8_S7_EEEEEERNS1_6vectorISG_NS5_ISG_EEEE (osqueryd + 0x42a5d42)#012#24 0x00005579b40abfd1 _ZN7osquery15RegistryFactory4callERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEES9_RKNS1_3mapIS7_S7_NS1_4lessIS7_EENS5_INS1_4pairIS8_S7_EEEEEERNS1_6vectorISG_NS5_ISG_EEEE (osqueryd + 0x42a3fd1)#012#25 0x00005579b1d354c4 _ZN7osquery15getQueryColumnsERKNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEERNS0_6vectorINS0_5tupleIJS6_NS_10ColumnTypeENS_13ColumnOptionsEEEENS4_ISD_EEEE (osqueryd + 0x1f2d4c4)#012#26 0x00005579b1d35301 _ZN7osquery3SQLC2ERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEEb (osqueryd + 0x1f2d301)#012#27 0x00005579b1d97f64 _ZN7osquery11Distributed17monitorNonnumericERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEES9_ (osqueryd + 0x1f8ff64)#012#28 0x00005579b1d97813 _ZN7osquery11Distributed10runQueriesEv (osqueryd + 0x1f8f813)#012#29 0x00005579b1ebc8e4 _ZN7osquery17DistributedRunner5startEv (osqueryd + 0x20b48e4)#012#30 0x00005579b40a8309 _ZN7osquery16InternalRunnable3runEv (osqueryd + 0x42a0309)#012#31 0x00005579b40a929d _ZNSt3__18__invokeIRMN7osquery16InternalRunnableEFvvERPS2_JEvEEDTcldsdeclsr3std3__1E7forwardIT0_Efp0_Efp_spclsr3std3__1E7forwardIT1_Efp1_EEEOT_OS8_DpOS9_ (osqueryd + 0x42a129d)#012#32 0x00005579b40a925d _ZNSt3__115__apply_functorIMN7osquery16InternalRunnableEFvvENS_5tupleIJPS2_EEEJLm0EENS5_IJEEEEENS_13__bind_returnIT_T0_T2_Xsr22__is_valid_bind_returnISA_SB_SC_EE5valueEE4typeERSA_RSB_NS_15__tuple_indicesIJXspT1_EEEEOSC_ (osqueryd + 0x42a125d)#012#33 0x00005579b40a91e1 _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEENS_6__bindIMN7osquery16InternalRunnableEFvvEJPS9_EEEEEEEEPvSF_ (osqueryd + 0x42a11e1)#012#34 0x00007fc0d5f771b7 start_thread (libc.so.6 + 0x961b7)#012#35 0x00007fc0d5ff939c __clone3 (libc.so.6 + 0x11839c)#012#012Stack trace of thread 6285:#012#0  0x00007fc0d5f73919 __futex_abstimed_wait_common (libc.so.6 + 0x92919)#012#1  0x00007fc0d5f76652 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x95652)#012#2  0x00005579b4bae18c _ZNSt3__118condition_variable15__do_timed_waitERNS_11unique_lockINS_5mutexEEENS_6chrono10time_pointINS5_12system_clockENS5_8durationIxNS_5ratioILl1ELl1000000000EEEEEEE (osqueryd + 0x4da618c)#012#3  0x00005579b40a8273 _ZNSt3__118condition_variable8wait_forIxNS_5ratioILl1ELl1000EEEEENS_9cv_statusERNS_11unique_lockINS_5mutexEEERKNS_6chrono8durationIT_T0_EE (osqueryd + 0x42a0273)#012#4  0x00005579b40a81a6 _ZN7osquery21InterruptibleRunnable5pauseENSt3__16chrono8durationIxNS1_5ratioILl1ELl1000EEEEE (osqueryd + 0x42a01a6)#012#5  0x00005579b1d7eda5 _ZN7osquery19ConfigRefreshRunner5startEv (osqueryd + 0x1f76da5)#012#6  0x00005579b40a8309 _ZN7osquery16InternalRunnable3runEv (osqueryd + 0x42a0309)#012#7  0x00005579b40a929d _ZNSt3__18__invokeIRMN7osquery16InternalRunnableEFvvERPS2_JEvEEDTcldsdeclsr3std3__1E7forwardIT0_Efp0_Efp_spclsr3std3__1E7forwardIT1_Efp1_EEEOT_OS8_DpOS9_ (osqueryd + 0x42a129d)#012#8  0x00005579b40a925d _ZNSt3__115__apply_functorIMN7osquery16InternalRunnableEFvvENS_5tupleIJPS2_EEEJLm0EENS5_IJEEEEENS_13__bind_returnIT_T0_T2_Xsr22__is_valid_bind_returnISA_SB_SC_EE5valueEE4typeERSA_RSB_NS_15__tuple_indicesIJXspT1_EEEEOSC_ (osqueryd + 0x42a125d)#012#9  0x00005579b40a91e1 _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEENS_6__bindIMN7osquery16InternalRunnableEFvvEJPS9_EEEEEEEEPvSF_ (osqueryd + 0x42a11e1)#012#10 0x00007fc0d5f771b7 start_thread (libc.so.6 + 0x961b7)#012#11 0x00007fc0d5ff939c __clone3 (libc.so.6 + 0x11839c)#012#012Stack trace of thread 6141:#012#0  0x00007fc0d5f73919 __futex_abstimed_wait_common (libc.so.6 + 0x92919)#012#1  0x00007fc0d5f76652 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x95652)#012#2  0x00005579b4bae18c _ZNSt3__118condition_variable15__do_timed_waitERNS_11unique_lockINS_5mutexEEENS_6chrono10time_pointINS5_12system_clockENS5_8durationIxNS_5ratioILl1ELl1000000000EEEEEEE (osqueryd + 0x4da618c)#012#3  0x00005579b40a8273 _ZNSt3__118condition_variable8wait_forIxNS_5ratioILl1ELl1000EEEEENS_9cv_statusERNS_11unique_lockINS_5mutexEEERKNS_6chrono8durationIT_T0_EE (osqueryd + 0x42a0273)#012#4  0x00005579b40a81a6 _ZN7osquery21InterruptibleRunnable5pauseENSt3__16chrono8durationIxNS1_5ratioILl1ELl1000EEEEE (osqueryd + 0x42a01a6)#012#5  0x00005579b25c8115 _ZN7osquery20WatcherWatcherRunner5startEv (osqueryd + 0x27c0115)#012#6  0x00005579b40a8309 _ZN7osquery16InternalRunnable3runEv (osqueryd + 0x42a0309)#012#7  0x00005579b40a929d _ZNSt3__18__invokeIRMN7osquery16InternalRunnableEFvvERPS2_JEvEEDTcldsdeclsr3std3__1E7forwardIT0_Efp0_Efp_spclsr3std3__1E7forwardIT1_Efp1_EEEOT_OS8_DpOS9_ (osqueryd + 0x42a129d)#012#8  0x00005579b40a925d _ZNSt3__115__apply_functorIMN7osquery16InternalRunnableEFvvENS_5tupleIJPS2_EEEJLm0EENS5_IJEEEEENS_13__bind_returnIT_T0_T2_Xsr22__is_valid_bind_returnISA_SB_SC_EE5valueEE4typeERSA_RSB_NS_15__tuple_indicesIJXspT1_EEEEOSC_ (osqueryd + 0x42a125d)#012#9  0x00005579b40a91e1 _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_EEEENS_6__bindIMN7osquery16InternalRunnableEFvvEJPS9_EEEEEEEEPvSF_ (osqueryd + 0x42a11e1)#012#10 0x00007fc0d5f771b7 start_thread (libc.so.6 + 0x961b7)#012#11 0x00007fc0d5ff939c __clone3 (libc.so.6 + 0x11839c)#012#012Stack trace of thread 6144:#012#0  0x00007fc0d5f73919 __futex_abstimed_wait_common (libc.so.6 + 0x92919)#012#1  0x00007fc0d5f762e9 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x952e9)#012#2  0x00005579b4
Jul 23 17:53:21 fedora audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-coredump@4-6364-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
🙌 2
s
I'm able to trigger the crash too
🙌 1
So it seems like the issue is that librpm internally uses sqlite to open its database and reroutes the sqlite3_log function to its own internal functions, with in turn will call the callback that we set in rpm_packages. I normally see the crash happening after the second query, and I can just make it crash with
select * from rpm_packages where name = "foo"
. It crashes when the sqlite logging is called, which then tries to call the logging functions of librpm, after is deinitialized in the table. librpm should be prevented from changing sqlite logging function, since unfortunately it's a global option and affects all parts of osquery.
Although hum, the bug I found actually happens only on the shell, interesting
In any case, the core dump I recall was inside the logging callback that we setup for the rpm errors inside the rpm table, so it's likely that the sqlite logging coming after librpm has being deinitialized is still the issue.
(and oups it's also shown in the trace you had above)
l
Hey folks, I've opened https://github.com/osquery/osquery/issues/8384 to track this issue.
Great findings Stefano!
Should we try updating to https://rpm.org/wiki/Releases/4.18.2? It mentions the following bugfix:
Don't muck with per-process global sqlite configuration from the db backend
s
oh and good finding ^
l
Copy code
+sqlite3_config() affects all in-process uses of sqlite. librpm being a
+low-level library, it has no business whatsoever making such decisions
+for the applications running on top of it. Besides that, the callback can
+easily end up pointing to an already closed database, causing an
+innocent API user to crash in librpm on an entirely unrelated error on
+some other database. "Oops."
+
+The sqlite API doesn't seem to provide any per-db or non-global context
+for logging errors, thus we can only remove the call and let sqlite output
+errors the way it pleases (print through stderr, presumably).
https://patchwork.yoctoproject.org/project/oe-core/patch/20230703065909.45555-1-anuj.mittal@intel.com/
💯 1
s
yeah thinking about it this should've likely broken a lot of apps, so an upstream fix was warranted.
âž• 1
l
Do you have instructions on how to upgrade librpm?
Maybe I can give it a go (next week after some PTO)
Though feel free to do it yourself it will probably be faster 🙂
s
It's always quite involved and not fully documented because it all depends on what changed. Very high level is, (now) take CentOS 7, install autoconf/automake, take the osquery-toolchain, clone their library repo, and try to make their build system configure and build with it. Then take the configuration artifacts, move them to osquery. While building look at all the flags and which files are getting compiled. Update the CMake file accordingly. Keep some notes in the README file next to the folder. Said that, yeah, I was going to say that I have time to look into it tomorrow.
l
That does look involved 🙂
Said that, yeah, I was going to say that I have time to look into it tomorrow.
Cool, thanks! Happy to re-test when I get back but probably not necessary since you were able to reproduce