My fav coding benchmark for frontier models is to build a simple RTS game in one file (js/html&...

senko • yesterday at 6:34 PM • 19 replies • view on HN

My fav coding benchmark for frontier models is to build a simple RTS game in one file (js/html/css). Claude Code with Opus 4.8 in ultracode mode nailed it, the best result so far:

https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v

The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).

Replies

seidleroni • today at 2:28 PM

I am absolutely gobsmacked how good the game is! I didn't complete the level fully but I completed all but one of the tasks. This is both smooth and fun and I'm surprised that a modern LLM can do something this well, let alone in a single file. It makes me realize how much the goalposts have been moved. A few years ago (ChatGPT 2? 2.5?) wasn't even able to implement a small Python script I would expect a junior engineer to be capable of producing. Now we're getting the tools to do something like this. You should think about how to "rate" the outputs or at least provide your own rankings.

egeozcan • today at 4:52 AM

I've been tasking LLMs to write a traditional AI for a full vibe-coded RTS. I remove the human players and let them battle. I don't know why but I enjoy watching AI players battle so much :)

In the repo, I even have a tournament script that calculates ELOs. So far, codex was unmatched. I'll try with Opus 4.8 too.

https://egeozcan.github.io/unnamed_rts/game/

https://github.com/egeozcan/unnamed_rts/blob/main/src/script...

➕ show 1 reply

calebgcc • today at 6:46 AM

I wonder if your previous prompts were part of the new RL fine tuning, and that’s why is now better at this specific question

jclay • yesterday at 7:00 PM

It almost appears as if the code was minified. The variable names are short and formatting looks like it's written to minimize whitespace. Did it write it in this compact format all on it's own?

➕ show 4 replies

apitman • yesterday at 7:19 PM

I like that benchmark. You should throw the results up on GitHub pages so people can try out the games.

➕ show 1 reply

skolos • today at 5:03 AM

How many times did you try? Same model running multiple times can produce both very good and very bad results. In my benchmark even 10 runs often not enough to tell for sure if one model is better than another.

➕ show 1 reply

RobinL • today at 5:47 AM

Nice, I recently found something like this was possible too. Gpt-5.5 one shotted the basic game, but then I added some ai generated graphics/sounds/music and asked it to write then up.

It's a vocab building game, playable here (desktop only): https://rupertlinacre.com/vocab_annihilation/

It kind of blows my mind I can go from: 'I want a fun way to help him learn vocabulary, and I loved total annihilation as a kid' to 'heres a game that's he finds genuinely fun that helps him learn something ' in a few prompts.

H3X_K1TT3N • yesterday at 10:23 PM

Thanks for also sharing the prompt. I've been testing claude by asking it to make similar things, so it's useful to see what other people are doing.

I do find it interesting that the visual style is pretty similar to things it's produced for me.

elAhmo • yesterday at 7:47 PM

What is ultracode mode?

➕ show 3 replies

digdugdirk • yesterday at 8:00 PM

Do you have a collection of these benchmark apps saved anywhere? I'd be particularly interested in seeing the relative cost differences between different models in a use case like this.

➕ show 1 reply

ammar_x • yesterday at 10:47 PM

Is there some sort of a leaderboard for this test? Like if you'd give each of Opus 4.8 and GPT 5.5 a score out of 100, what would the scores be?

➕ show 1 reply

jmtame • today at 12:34 AM

Wow, that's impressive. Had fun playing it for 10 minutes locally. Found myself wanting to discover an enemy base :)

jryan49 • yesterday at 8:06 PM

Kinda buggy, but impressively nonetheless. How long did it take?

➕ show 1 reply

zuzululu • today at 7:25 AM

some reason that website is showing up as high risk and i cannot view it , I had to open it from my mobile phone.

it looks quite impressive, I don't use claude currently but hearing good things about it...from codex users ironically

➕ show 1 reply

fireant • today at 3:25 AM

Wow that looks really impressive. Both the UI and the content looks good, the game is a bit buggy but still nice!

Madmallard • today at 1:46 AM

Okay now have it implement an authoritative server with reliable netcode and reconnection/disconnection logic, lobbies, and finding games, in-game chat, synchronized state around starting and ending games, resignations and such

shlewis • today at 12:29 AM

How much did it cost?

l3x4ur1n • yesterday at 7:40 PM

Played it to the end. Pretty neat!

veqq • today at 4:23 AM

wow

alt Hacker News

Replies