Our builds are running out of disk on the AWS runn...
# core
s
Our builds are running out of disk on the AWS runners. https://github.com/osquery/osquery/actions/workflows/self_hosted_runners.yml @Stefano Bonicatti Did you fix this last time?
s
No I did not, we merged some tests to reduce the size consumed but it was a short term solution. The long term solution is a slightly more substantial change, where we have to change how we register tables and plugins, so that instead of using global initialization side effects, it's actually an explicit function call at runtime, so that we can remove a linker flag (whole-archive) that doesn't permit dropping unused code sections in some of the static libraries that are linked to make the final executable (tests specifically).
s
I was poking through the AWS account, and it looks like the disk might be 32gig.
s
yeah and we have 26GB of those free, so the build is definitely too big...
s
I noticed there was a worker that had been stuck running for months. I wonder if that contributed, somehow. But I would have expected each worker to get a fresh EBS
s
No I don't think so, the 26GB read came from a run that failed with disk out of space, so it's a new one each time
s
It seems reasonable to drop some of those df commands into the normal builds.
But looking at https://github.com/osquery/osquery/actions/runs/6084513342/job/16506598189
Copy code
/dev/root        84G   65G   19G  78% /
/dev/sdb1        14G  4.1G  9.0G  31% /mnt
Both feel weird
s
nah that one is the standard runner, I forgot that I should run it on the AWS runner ^^'
s
Ah.
s
Although I did not connect the dots. How is it that it's filling up 26GB of disk on the AWS runner but not 19GB on the standard runner
s
I do think it would be reasonable to toss some df/du/tree style things into the builds. May as well always collect it pre and post build.
s
I'll double check locally with my M2 to see what's the supposed build size first
So a local x86_64 build on Linux (RelWithDebInfo) is 20GB
and that's only the build folder
Same for aarch64
Ok, first strange thing is that on the x86_64 runner cloning the source code (which should be ~4GB) takes no space. The other thing is that on x86_64 we do a RelWithDebInfo build and we use that to make the packages (due to the need for the symbols), but we do not build the tests there, we only build/run them on the Release and Debug (with no debug symbols though)
We could do that too for aarch64 for now.. so tests and debug symbols do not overlap
It's obviously an additional build
So I'm going to do a couple of tests with using strategy/matrix to do this, there might be an issue that impede us on not duplicating the code, and also, I might break the logic to stop runners. If you'd like, I can keep an eye sometimes on the runners, to prevent them to run for so long in the future too, but I don't have access to that account.
either that or we can also add a step which lists instances older than 3 hours via
aws-cli
and kill them.
s
I’d be happy to give you AWS access, but I’m slightly embarrassed I can’t easily do that. The AWS accounts are a horrific tangle.
Oh yes, we should totally add some kind of job that kills instances older than a couple hours. It feels like a good action to run a couple times a day.