> melted away in the face of massive context sizes
If only. There is a huge difference between "Gives good responses/can easily spot things within N context size" and "Technically works but sucks within N context size", almost all models basically become cave-people once you go beyond 50% of the "supported" context size, meaning while they may technically work with 1 million output tokens, those last 500K tokens are gonna be massively "dumber" than the first 500k tokens.
There's at least one benchmark that attempts to measure this, but it has been running for a year plus so it's quite infrequently updated now.
https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...