This is great work by Dawn Song 's team. A huge part of botsbench.com for comparing agents & models for investigation has been in protecting against this kind of thing. As AI & agents keep getting more effective & tenacious, some of the things we've had to add protections against:
- Contamination: AI models knowing the answers out of the gate b/c pretraining on the internet and everything big teams can afford to touch. At RSAC for example, we announced Anthropic's 4.6 series is the first frontier model to have serious training set contamination on Splunk BOTS.
- Sandboxing: Agents attacking the harness, as is done here - so run the agent in a sandbox, and keep the test harness's code & answerset outside
- Isolation: Frontier agent harnesses persist memory all over the place, where work done on one question might be used to accelerate the next. To protect against that, we do fresh sandboxing per question. This is a real feature for our work in unlocking long-horizon AI for investigations, so stay tuned for what's happening here :)
"You cannot improve what you cannot measure" - Lord Kelvin