We send our osquery data to Firehose, which then is crawled by Glue and then we query it via Athena.
Osquery sends up JSON, but that JSON is a bit different for each osquery task. This makes Athena angry since Glue will crawl things and create tables how it think things look. For example the "columns" key in the JSON differes, so the generated struct won't work for all resluts osquery sends up, so Athena throws an error. We have a workaround at the moment where we just handle "columns" as a string, but then ewe of course can't search in it the same way.
Do you have any better solutions for this? Am I missing something or how do people usually do this?
02/20/2019, 4:52 PM
You can run osquery in snapshot mode, then you won’t get diffs. But, you will get more data.
I don’t know those tools, but you might be able to write something to take a diff and convert it into an add or remove.
02/20/2019, 6:29 PM
It's not those kinds of diffs though. Since there are different queries different data will be returned (eg. the osquery "heartbeat" query vs the
select * from crontab
will have different looking JSONs)
But thanks for the tips! Good to know those things either way
02/20/2019, 6:51 PM
Yes, results are returned in the schema of the query
02/20/2019, 6:55 PM
Indeed! And that makes Athena/Glue sad. I haven't seen a way to direct different schemas to different firehose streams or some other way of being able to have them nicely in Athena. The workaround we have now is to convert the "columns" to a string (in the Glue crawler config), but that limits your ability to filter etc when working in Athena
02/20/2019, 7:33 PM
I don’t really know athena or glue.
If this were sql, I’d suggest sending different queries to different tables
Or using something that supported a wide sparse schema
03/10/2020, 4:24 PM
Hi @Johan Edholm we are looking at POCing a similar solution with Athena and Glue, I wonder if you managed to get any further with this? And any learnings on the way? Thanks