My fav coding benchmark for frontier models is to build a simple RTS game in one file (js/html/css). Claude Code with Opus 4.8 in ultracode mode nailed it, the best result so far:
https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v
The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).
I've been tasking LLMs to write a traditional AI for a full vibe-coded RTS. I remove the human players and let them battle. I don't know why but I enjoy watching AI players battle so much :)
In the repo, I even have a tournament script that calculates ELOs. So far, codex was unmatched. I'll try with Opus 4.8 too.
https://egeozcan.github.io/unnamed_rts/game/
https://github.com/egeozcan/unnamed_rts/blob/main/src/script...
I wonder if your previous prompts were part of the new RL fine tuning, and that’s why is now better at this specific question
It almost appears as if the code was minified. The variable names are short and formatting looks like it's written to minimize whitespace. Did it write it in this compact format all on it's own?
I like that benchmark. You should throw the results up on GitHub pages so people can try out the games.
How many times did you try? Same model running multiple times can produce both very good and very bad results. In my benchmark even 10 runs often not enough to tell for sure if one model is better than another.
Nice, I recently found something like this was possible too. Gpt-5.5 one shotted the basic game, but then I added some ai generated graphics/sounds/music and asked it to write then up.
It's a vocab building game, playable here (desktop only): https://rupertlinacre.com/vocab_annihilation/
It kind of blows my mind I can go from: 'I want a fun way to help him learn vocabulary, and I loved total annihilation as a kid' to 'heres a game that's he finds genuinely fun that helps him learn something ' in a few prompts.
Thanks for also sharing the prompt. I've been testing claude by asking it to make similar things, so it's useful to see what other people are doing.
I do find it interesting that the visual style is pretty similar to things it's produced for me.
Do you have a collection of these benchmark apps saved anywhere? I'd be particularly interested in seeing the relative cost differences between different models in a use case like this.
Is there some sort of a leaderboard for this test? Like if you'd give each of Opus 4.8 and GPT 5.5 a score out of 100, what would the scores be?
Wow, that's impressive. Had fun playing it for 10 minutes locally. Found myself wanting to discover an enemy base :)
Kinda buggy, but impressively nonetheless. How long did it take?
some reason that website is showing up as high risk and i cannot view it , I had to open it from my mobile phone.
it looks quite impressive, I don't use claude currently but hearing good things about it...from codex users ironically
Wow that looks really impressive. Both the UI and the content looks good, the game is a bit buggy but still nice!
Okay now have it implement an authoritative server with reliable netcode and reconnection/disconnection logic, lobbies, and finding games, in-game chat, synchronized state around starting and ending games, resignations and such
How much did it cost?
Played it to the end. Pretty neat!
wow
I am absolutely gobsmacked how good the game is! I didn't complete the level fully but I completed all but one of the tasks. This is both smooth and fun and I'm surprised that a modern LLM can do something this well, let alone in a single file. It makes me realize how much the goalposts have been moved. A few years ago (ChatGPT 2? 2.5?) wasn't even able to implement a small Python script I would expect a junior engineer to be capable of producing. Now we're getting the tools to do something like this. You should think about how to "rate" the outputs or at least provide your own rankings.