Qwen3-Max-Thinking

484 points • by vinhnx • yesterday at 3:23 PM • 415 comments • view on HN

Comments

One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the model more to better steer the model, and are closer to “spend more to get more” than “get more for less.” They’re still valuable, but they operate on a different economic tradeoff than what I think we’re used to talking about in tech.

➕ show 5 replies

boutell • today at 1:13 PM

The most important benchmark:

https://boutell.dev/misc/qwen3-max-pelican.svg

I used Simon Willison's usual prompt.

It thought for over 2 minutes (free account). The commentary was even more glowing than the image.

It has a certain charm.

torginus • yesterday at 5:00 PM

It just occured to me that it underperforms Opus 4.5 on benchmarks when search is not enabled, but outperforms it when it is - is it possible the the Chinese internet has better quality content available?

My problem with deep research tends to be that what it does is it searches the internet, and most of the stuff it turns up is the half baked garbage that gets repeated on every topic.

➕ show 3 replies

isusmelj • yesterday at 3:58 PM

I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/en/model-studio/models?spm...

➕ show 4 replies

syntaxing • yesterday at 5:32 PM

Hacker News strongly believes Opus 4.5 is the defacto standard and China was consistently 8+ month behind. Curious how this performs. It’ll be a big inflection point if it performs as well as its benchmarks.

➕ show 2 replies

siliconc0w • yesterday at 3:53 PM

I don't see a hugging face link, is Qwen no longer releasing their models?

➕ show 2 replies

ezekiel68 • yesterday at 10:36 PM

Last autumn I tried Qwen3-coder via CLI agents like trae to help add significant advanced features to a rust codebase. It consistently outperformed (at the time) Gemini 2.5 Pro and Claude Opus 3.5 with its ability to generate and re-factor code such that the system stayed coherent and improved performance and efficiency (this included adding Linux shared-memory IPC calls and using x86_64 SIMD intrinsics in rust).

I was very impressed, but I racked up a big bill (for me, in the hundreds of dollars per month) because I insisted on using the Alibaba provider to get the highest context window size and token cache.

mohsen1 • yesterday at 5:50 PM

Is this available on Open Router yet? I want it to go head-to-head against Gemini 3 Flash which is the king of playing Mafia so far

https://mafia-arena.com

➕ show 2 replies

arendtio • yesterday at 4:02 PM

> By scaling up model parameters and leveraging substantial computational resources

So, how large is that new model?

➕ show 2 replies

throwaw12 • yesterday at 3:50 PM

Aghhh, I wished they release a model which outperforms Opus 4.5 in agentic coding in my earlier comments, seems I should wait more. But I am hopeful

➕ show 6 replies

Alifatisk • yesterday at 7:25 PM

Can't wait for the benchmark at artificial analysis. Qwen team doesn't seem to have updated the information about this new model yet https://chat.qwen.ai/settings/model. I tried getting an api key from alibabacloud, but the amount of steps from creating an account made me stop, it was too much. It should be this difficult.

Incredible work anyways!

gcr • yesterday at 9:56 PM

Is there an open-source release accompanying this announcement or is this a proprietary model for the time being?

dajonker • yesterday at 10:06 PM

These LLM benchmarks are like interviews for software engineers. They get drilled on advanced algorithms for distributed computing and they ace the questions. But then it turns out that the job is to add a button the user interface and it uses new tailwind classes instead of reusing the existing ones so it is just not quite right.

ytrt54e • yesterday at 5:16 PM

I cannot even open the page; maybe I am blacklisted for asking about Tiananmen Square when their AI first hit the news?

➕ show 1 reply

treefry • yesterday at 5:53 PM

Are they likely to take a new strategy that they no longer open source their largest and strongest models?

➕ show 2 replies

jbverschoor • yesterday at 8:46 PM

"As of January 2026, Apple has not released an iPhone 17 series. Apple typically announces new iPhones in September each year, so the iPhone 17 series would not be available until at least September 2025 (and we're currently in January 2026). The most recent available models would be the iPhone 16 series."

Hmmmm ok

pier25 • yesterday at 5:08 PM

Tried it and it's super slow compared to others LLMs.

I imagine the Alibaba infra is being hammered hard.

➕ show 1 reply

Mashimo • yesterday at 3:58 PM

I tried to search, could not find anything, do they offer subscriptions? Or only pay per tokens?

➕ show 1 reply

ndom91 • yesterday at 7:51 PM

Not released on Huggingface? :sadge:

elinear • yesterday at 6:40 PM

Benchmarks pasted here, with top scores highlighted. Overall Qwen Max is pretty competitive with the others here.

  Capability                            Benchmark           GPT-5.2-Thinking   Claude-Opus-4.5   Gemini 3 Pro   DeepSeek V3.2   Qwen3-Max-Thinking
  Knowledge                             MMLUPro             87.4               89.5              *89.8*         85.0            85.7            
  Knowledge                             MMLURedux           95.0               95.6              *95.9*         94.5            92.8            
  Knowledge                             CEval               90.5               92.2              93.4           92.9            *93.7*      
  STEM                                  GPQA                *92.4*             87.0              91.9           82.4            87.4           
  STEM                                  HLE                 35.5               30.8              *37.5*         25.1            30.2           
  Reasoning                             LiveCodeBench v6    87.7               84.8              *90.7*         80.8            85.9           
  Reasoning                             HMMT Feb 25         *99.4*             -                 97.5           92.5            98.0            
  Reasoning                             HMMT Nov 25         -                  -                 93.3           90.2            *94.7*      
  Reasoning                             IMOAnswerBench      *86.3*             84.0              83.3           78.3            83.9           
  Agentic Coding                        SWE Verified        80.0               *80.9*            76.2           73.1            75.3           
  Agentic Search                        HLE (w/ tools)      45.5               43.2              45.8           40.8            *49.8*     
  Instruction Following & Alignment     IFBench             *75.4*             58.0              70.4           60.7            70.9           
  Instruction Following & Alignment     MultiChallenge      57.9               54.2              *64.2*         47.3            63.3           
  Instruction Following & Alignment     ArenaHard v2        80.6               76.7              81.7           66.5            *90.2*      
  Tool Use                              Tau² Bench          80.9               *85.7*            85.4           80.3            82.1           
  Tool Use                              BFCLV4              63.1               *77.5*            72.5           61.2            67.7            
  Tool Use                              Vita Bench          38.2               *56.3*            51.6           44.1            40.9           
  Tool Use                              Deep Planning       *44.6*             33.9              23.3           21.6            28.7           
  Long Context                          AALCR               72.7               *74.0*            70.7           65.0            68.7

DeathArrow • yesterday at 4:09 PM

Mandatory pelican on bicycle: https://www.svgviewer.dev/s/U6nJNr1Z

➕ show 2 replies

pmarreck • yesterday at 6:01 PM

I asked it about "Chinese cultural dishonesty" (such as the 2019 wallet experiment, but wait for it...) and it probably had the most fascinating and subtle explanation of it I've ever read. It was clearly informed by Chinese-language sources (which in this case was good... references to Confucianism etc.) and I have to say that this is the first time I feel more enlightened about what some Westerners may perceive as a real problem.

I wasn't logged in so I don't have the ability to link to the conversation but I'm exporting it for my records.

diblasio • yesterday at 4:54 PM

[flagged]

➕ show 34 replies

sacha1bu • today at 5:57 AM

Great to see reasoning taken seriously — Qwen3-Max-Thinking exposing explicit reasoning steps and scoring 100% on tough benchmarks is a big deal for complex problem solving. Looking forward to seeing how this changes real-world coding and logic tasks.

airstrike • yesterday at 4:12 PM

2026 will be the year of open and/or small models.

➕ show 2 replies

igravious • yesterday at 9:15 PM

The title of the article is: “Pushing Qwen3-Max-Thinking Beyond its Limits”

lysace • yesterday at 4:27 PM

I tried it at https://chat.qwen.ai/.

Prompt: "What happened on Tiananmen square in 1989?"

Reply: "Oops! There was an issue connecting to Qwen3-Max. Content Security Warning: The input text data may contain inappropriate content."

➕ show 6 replies

maximgeorge • yesterday at 6:41 PM

[dead]

sciencesama • yesterday at 5:01 PM

what ram and what minimum system req do you need to run this on personal systems !

➕ show 1 reply

3ds • today at 11:11 AM

What is the tiananmen massacre?

> Oops! There was an issue connecting to Qwen3-Max.

> Content Security Warning: The input text data may contain inappropriate content.

xcodevn • yesterday at 4:46 PM

I'm not familiar with these open-source models. My bias is that they're heavily benchmaxxing and not really helpful in practice. Can someone with a lot of experience using these, as well as Claude Opus 4.5 or Codex 5.2 models, confirm whether they're actually on the same level? Or are they not that useful in practice?

P.S. I realize Qwen3-Max-Thinking isn't actually an open-weight model (only accessible via API), but I'm still curious how it compares.

➕ show 3 replies

alt Hacker News

Qwen3-Max-Thinking

Comments