logoalt Hacker News

Arena AI Model ELO History

55 pointsby mayerwintoday at 3:19 AM39 commentsview on HN

Hi HN,

I built a live tracker to visualize the lifecycle and performance changes of flagship AI models.

We've all experienced the phenomenon where a flagship model feels amazing at launch, but weeks later, it suddenly feels a bit off. I wanted to see if this was just a feeling or a measurable reality, so I built a dashboard to track historical ELO ratings from Arena AI.

Instead of a massive spaghetti chart of every single model variant, the logic plots exactly ONE continuous curve per major AI lab. It dynamically tracks their highest-rated flagship model over time, which makes both the sudden generational jumps and the slow performance decays much easier to see. It took quite a lot of iterations to get the chart to look nice on mobile as well. Optional dark mode included.

However, I have a specific data blindspot that I'm hoping this community might have insights on.

Arena AI largely relies on testing API endpoints. But as we know, consumer chat UIs often layer on heavy system prompts, safety wrappers, or silently switch to heavily quantized models under high load to save compute. API benchmarks don't fully capture this "nerfing" that everyday web users experience.

Does anyone know of any historical ELO or evaluation datasets that specifically scrape or test outputs from the consumer web UIs rather than raw APIs?

I'd love to integrate that data for a more accurate picture of the consumer experience. The project is open-source (repo link in the footer), so I'd appreciate any feedback, or pointers to datasets!


Comments

whiplash451today at 8:41 AM

Neat. Would you add the option to normalize the elo over time (e.g update the model used as an anchor for the elo computation) so the diff between labs is more visible?

underyxtoday at 5:06 AM

> the slow performance decays

the decays are just more capable other models entering the population, making all prior models lose more frequently

ponyoustoday at 8:24 AM

Seems like Chinese labs are the only ones that are trustworthy (at least when it gets to this specific issue). This feels so ironic haha

show 1 reply
tedsanderstoday at 5:46 AM

For what it's worth, I work at OpenAI and I can guarantee you that we don't switch to heavily quantized models or otherwise nerf them when we're under high load. It's true that the product experience can change over time - we're frequently tweaking ChatGPT & Codex with the intention of making them better - but we don't pull any nefarious time-of-day shenanigans or similar. You should get what you pay for.

show 2 replies
fphtoday at 8:32 AM

Very neat! It would be great to extend it to non-flagship models as well.

cheriootoday at 6:50 AM

The interesting thing I find is how Anthropic has been more consistently improving over time in the last few years, that allows it to catchup and surpass OpenAI and Google. The latter two have pretty much plateau over the last year or so. GPT 5.5 is somehow not moving the needle at all.

I hope to see the other labs can bring back competition soon!

show 1 reply
kimjune01today at 7:23 AM

Although Arena is adversarial and resistant to goodharting, it's not immune. Models that train on Arena converge on helpfulness, not necessarily truthiness

jdw64today at 7:19 AM

This is great, but personally, I really wish we had an Elo leaderboard specifically for the quality of coding agents.

Honestly, in my opinion, GPT-5.5 Codex doesn't just crush Claude Code 4.7 opus —it's writing code at a level so advanced that I sometimes struggle to even fully comprehend it. Even when navigating fairly massive codebases spanning four different languages and regions (US, China, Korea, and Japan), Codex's performance is simply overwhelming.

How would we even go about properly measuring and benchmarking the Elo for autonomous agents like this?

show 1 reply
eistoday at 5:38 AM

The Elo rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.

You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.

tedsanderstoday at 5:43 AM

FYI, Elo isn't an acronym - it's a person's name. No need to capitalize it as ELO.

show 2 replies
Thomashuettoday at 6:52 AM

It seems to be a USA only thing, Chinese models and Mistral don't show any downward trend.

show 1 reply
refulgentistoday at 6:04 AM

Is this slop? It has wildly aggressive language that agrees with a subset of pop sentiment, re: models being “nerfed”. It promises to reveal this nerfing. Then, it goes on to…provide an innocuous mapping of LM Arena scores that always go up?