I work at runloop and I've spent a considerable amount of time getting various benchmarks to run with very high concurrency (thousands at once). My experience is similar to your own: it takes a ton of time and effort setting up benchmarks to run at scale with protection against reward hacks.
Keeping a benchmark test harness secure and fast is non-trivial. You need to keep the grading script and the solution off the box, use network controls, deal with external resource usage, etc. It's a lot of work. I don't think it's realistic to expect benchmark authors to bullet proof their benchmark runners. Most benchmarks are written to be run conveniently on a single machine (ie. in docker), not to run in parallel across tends of thousands of secure, isolated machines.