They stacked the deck. If v2 was still rule inference + spatial reasoning, a bit like juiced up Raven's progressive matrices, then v3 adds a whole new multi-turn explore/exploit agentic dimension to it.
Given how hard even pure v2 was for modern LLMs, I'm not surprised to see v3 crush them. But that wouldn't last.
They stacked the deck. If v2 was still rule inference + spatial reasoning, a bit like juiced up Raven's progressive matrices, then v3 adds a whole new multi-turn explore/exploit agentic dimension to it.
Given how hard even pure v2 was for modern LLMs, I'm not surprised to see v3 crush them. But that wouldn't last.