I run a proofreading benchmark that tests how well models can find and fix errors in English text. T...

artursapek • today at 8:41 PM • 0 replies • view on HN

I run a proofreading benchmark that tests how well models can find and fix errors in English text. They get several passes in a simple agent loop. Sonnet 5 is definitely better than Sonnet 4.6, but inferior on both quality and cost to GLM 5.1, GLM 5.2, Gemini 3.1 Flash, and Gemini 3.1 Pro. https://revise.io/errata-bench

alt Hacker News