This is a bad paper.
Benchmarking is hard to do properly. It isn't helped when people claim that exploiting the environment is some kind of flaw.
It's not. Anytime you see unexpected results running a benchmark you need to inspect what it is doing.
I recently built a yet-to-be-released where the "hard" level pushes frontier models extremely hard: Opus scores around 40%, Gemini around 60%, and GPT 5.4 around.. 0%
I inspected the traces and it turns out GPT was looking at the task and saying "I must be honest - I can't solve this task reliably" and refusing it.
> Navigating Chromium to a file:// URL reads the gold answer directly from the task config — giving ~100% on all 812 WebArena tasks.
I mean... yes? Make sure it doesn't do this?