logoalt Hacker News

raincoleyesterday at 7:17 PM11 repliesview on HN

Even before this, Gemini 3 has always felt unbelievably 'general' for me. It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:

1. It's an LLM, not something trained to play Balatro specifically

2. Most (probably >99.9%) players can't do that at the first attempt

3. I don't think there are many people who posted their Balatro playthroughs in text form online

I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.

[0]: https://balatrobench.com/


Replies

nerdsnipertoday at 12:34 AM

My experience also shows that Gemini has unique strength in “generalized” (read: not coding) tasks. Gemini 2.5 Pro and 3 Pro seems stronger at math and science for me, and their Deep Research usually works the hardest, as long as I run it during off-hours. Opus seems to beat Gemini almost “with one hand tied behind its back” in coding, but Gemini is so cheap that it’s usually my first stop for anything that I think is likely to be relatively simple. I never worry about my quota on Gemini like I do with Opus or Chat-GPT.

Comparisons generally seem to change much faster than I can keep my mental model updated. But the performance lead of Gemini on more ‘academic’ explorations of science, math, engineering, etc has been pretty stable for the past 4 months or so, which makes it one of the longer-lasting trends for me in comparing foundation models.

I do wish I could more easily get timely access to the “super” models like Deep Think or o3 pro. I never seem to get a response to requesting access, and have to wait for public access models to catch up, at which point I’m never sure if their capabilities have gotten diluted since the initial buzz died down.

They all still suck at writing an actually good essay/article/literary or research review, or other long-form things which require a lot of experienced judgement to come up with a truly cohesive narrative. I imagine this relates to their low performance in humor - there’s just so much nuance and these tasks represent the pinnacle of human intelligence. Few humans can reliably perform these tasks to a high degree of performance either. I myself am only successful some percentage of the time.

tlyesterday at 10:27 PM

Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time.

show 1 reply
ankit219yesterday at 10:28 PM

Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.

(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)

ebiesteryesterday at 8:58 PM

It's trained on YouTube data. It's going to get roffle and drspectred at the very least.

silver_sunyesterday at 8:51 PM

Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.

Nonetheless I still think it's impressive that we have LLMs that can just do this now.

show 2 replies
winstonpyesterday at 7:44 PM

DeepSeek hasn't been SotA in at least 12 calendar months, which might as well be a decade in LLM years

show 1 reply
dudisubektiyesterday at 8:04 PM

But... there's Deepseek v3.2 in your link (rank 7)

show 1 reply
littlestymaaryesterday at 8:03 PM

> . I don't think there are many people who posted their Balatro playthroughs in text form online

There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.

show 1 reply
tehsauceyesterday at 10:17 PM

How does it do on gold stake?

acid__yesterday at 8:27 PM

> Most (probably >99.9%) players can't do that at the first attempt

Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.

Falsintioyesterday at 8:09 PM

[dead]