Here are the numbers from their bar chart: 1. SWE-bench Pro Model Score (%) ...

mlmonkey • today at 12:30 AM • 4 replies • view on HN

Here are the numbers from their bar chart:

    1. SWE-bench Pro
    Model Score (%)
    GLM-5.2 62.1
    GLM-5.1 58.4
    Claude Opus 4.8 69.2
    GPT-5.5 58.6
    Gemini 3.1 Pro 54.2

    2. Terminal-Bench 2.1
    Model Score (%)
    GLM-5.2 81.0
    GLM-5.1 63.5
    Claude Opus 4.8 85.0
    GPT-5.5 84.0
    Gemini 3.1 Pro 74.0
    
    3. NL2Repo
    Model Score (%)
    GLM-5.2 48.9
    GLM-5.1 42.7
    Claude Opus 4.8 69.7
    GPT-5.5 50.7
    Gemini 3.1 Pro 33.4
    
    4. DeepSWE
    Model Score (%)
    GLM-5.2 46.2
    GLM-5.1 18.0
    Claude Opus 4.8 58.0
    GPT-5.5 70.0
    Gemini 3.1 Pro 10.0
    
    5. ProgramBench
    Model Score (%)
    GLM-5.2 63.7
    GLM-5.1 50.9
    Claude Opus 4.8 71.9
    GPT-5.5 70.8
    Gemini 3.1 Pro 39.5
    
    6. MCP-Atlas
    Model Score (%)
    GLM-5.2 77.0
    GLM-5.1 71.8
    Claude Opus 4.8 77.8
    GPT-5.5 75.3
    Gemini 3.1 Pro 69.2
    
    7. Tool-Decathlon
    Model Score (%)
    GLM-5.2 48.2
    GLM-5.1 40.7
    Claude Opus 4.8 59.9
    GPT-5.5 55.6
    Gemini 3.1 Pro 48.8
    
    8. Humanity's Last Exam
    Model Base Score (%) Score w/ Tools (%)
    GLM-5.2 40.5 54.7
    GLM-5.1 31.0 52.3
    Claude Opus 4.8 49.8 57.9
    GPT-5.5 41.4 52.2
    Gemini 3.1 Pro 45.0 51.4

Seems to be handily beating Gemini 3.1 Pro. What _is_ Google DeepMind doing (other than bleeding talent to A\ ) ?

Replies

vineyardmike • today at 3:07 AM

> What _is_ Google DeepMind doing

I feel like it has been pretty visible about what’s happening, between their press and products and financial statements. It’s just not what people are accustomed to expect.

First, Google has become a major compute provider for competitors, thanks to TPUs. They’ve talked about allocating TPUs to GCP instead of their first party products. I can only assume it’s because they’re collecting a higher margin, and it covers the cost of data center buildout - which they’ve been aggressively doing. I wouldn’t be surprised if they made the financial decisions to delay or slow training for Gemini 3.5 when they provided last minute compute to Anthropic this spring.

Second, Gemini has very directly not been focused on agentic coding, maybe 3.5 Flash being the change. They’ve built models they can deploy to watch YouTube videos, Nest cameras, scale to AI in search, understand fitness info in Fitbit, etc. They’re very clearly not focused around agentic/coding. They’ve put in a ton of efforts into multimodal data in and out, and they’re the only major lab working on video generation still. There was leak/rumor that their cofounder (brin) was getting involved in the model training to renew focus on agents so maybe this will change, and again 3.5 already feels different.

linzhangrun • today at 5:58 AM

Just waiting for the 3.5 Pro they said would come out this month. Gemini is pretty much useless for any serious work right now.

verdverm • today at 2:16 AM

copying the graphs and tables to HN is noisy and harder to read

alt Hacker News

Replies