The most important benchmark:
https://boutell.dev/misc/qwen3-max-pelican.svg
I used Simon Willison's usual prompt.
It thought for over 2 minutes (free account). The commentary was even more glowing than the image.
It has a certain charm.
It just occured to me that it underperforms Opus 4.5 on benchmarks when search is not enabled, but outperforms it when it is - is it possible the the Chinese internet has better quality content available?
My problem with deep research tends to be that what it does is it searches the internet, and most of the stuff it turns up is the half baked garbage that gets repeated on every topic.
I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/en/model-studio/models?spm...
Hacker News strongly believes Opus 4.5 is the defacto standard and China was consistently 8+ month behind. Curious how this performs. It’ll be a big inflection point if it performs as well as its benchmarks.
I don't see a hugging face link, is Qwen no longer releasing their models?
Last autumn I tried Qwen3-coder via CLI agents like trae to help add significant advanced features to a rust codebase. It consistently outperformed (at the time) Gemini 2.5 Pro and Claude Opus 3.5 with its ability to generate and re-factor code such that the system stayed coherent and improved performance and efficiency (this included adding Linux shared-memory IPC calls and using x86_64 SIMD intrinsics in rust).
I was very impressed, but I racked up a big bill (for me, in the hundreds of dollars per month) because I insisted on using the Alibaba provider to get the highest context window size and token cache.
Is this available on Open Router yet? I want it to go head-to-head against Gemini 3 Flash which is the king of playing Mafia so far
> By scaling up model parameters and leveraging substantial computational resources
So, how large is that new model?
Aghhh, I wished they release a model which outperforms Opus 4.5 in agentic coding in my earlier comments, seems I should wait more. But I am hopeful
Can't wait for the benchmark at artificial analysis. Qwen team doesn't seem to have updated the information about this new model yet https://chat.qwen.ai/settings/model. I tried getting an api key from alibabacloud, but the amount of steps from creating an account made me stop, it was too much. It should be this difficult.
Incredible work anyways!
Is there an open-source release accompanying this announcement or is this a proprietary model for the time being?
These LLM benchmarks are like interviews for software engineers. They get drilled on advanced algorithms for distributed computing and they ace the questions. But then it turns out that the job is to add a button the user interface and it uses new tailwind classes instead of reusing the existing ones so it is just not quite right.
I cannot even open the page; maybe I am blacklisted for asking about Tiananmen Square when their AI first hit the news?
Are they likely to take a new strategy that they no longer open source their largest and strongest models?
"As of January 2026, Apple has not released an iPhone 17 series. Apple typically announces new iPhones in September each year, so the iPhone 17 series would not be available until at least September 2025 (and we're currently in January 2026). The most recent available models would be the iPhone 16 series."
Hmmmm ok
Tried it and it's super slow compared to others LLMs.
I imagine the Alibaba infra is being hammered hard.
I tried to search, could not find anything, do they offer subscriptions? Or only pay per tokens?
Not released on Huggingface? :sadge:
Benchmarks pasted here, with top scores highlighted. Overall Qwen Max is pretty competitive with the others here.
Capability Benchmark GPT-5.2-Thinking Claude-Opus-4.5 Gemini 3 Pro DeepSeek V3.2 Qwen3-Max-Thinking
Knowledge MMLUPro 87.4 89.5 *89.8* 85.0 85.7
Knowledge MMLURedux 95.0 95.6 *95.9* 94.5 92.8
Knowledge CEval 90.5 92.2 93.4 92.9 *93.7*
STEM GPQA *92.4* 87.0 91.9 82.4 87.4
STEM HLE 35.5 30.8 *37.5* 25.1 30.2
Reasoning LiveCodeBench v6 87.7 84.8 *90.7* 80.8 85.9
Reasoning HMMT Feb 25 *99.4* - 97.5 92.5 98.0
Reasoning HMMT Nov 25 - - 93.3 90.2 *94.7*
Reasoning IMOAnswerBench *86.3* 84.0 83.3 78.3 83.9
Agentic Coding SWE Verified 80.0 *80.9* 76.2 73.1 75.3
Agentic Search HLE (w/ tools) 45.5 43.2 45.8 40.8 *49.8*
Instruction Following & Alignment IFBench *75.4* 58.0 70.4 60.7 70.9
Instruction Following & Alignment MultiChallenge 57.9 54.2 *64.2* 47.3 63.3
Instruction Following & Alignment ArenaHard v2 80.6 76.7 81.7 66.5 *90.2*
Tool Use Tau² Bench 80.9 *85.7* 85.4 80.3 82.1
Tool Use BFCLV4 63.1 *77.5* 72.5 61.2 67.7
Tool Use Vita Bench 38.2 *56.3* 51.6 44.1 40.9
Tool Use Deep Planning *44.6* 33.9 23.3 21.6 28.7
Long Context AALCR 72.7 *74.0* 70.7 65.0 68.7Mandatory pelican on bicycle: https://www.svgviewer.dev/s/U6nJNr1Z
I asked it about "Chinese cultural dishonesty" (such as the 2019 wallet experiment, but wait for it...) and it probably had the most fascinating and subtle explanation of it I've ever read. It was clearly informed by Chinese-language sources (which in this case was good... references to Confucianism etc.) and I have to say that this is the first time I feel more enlightened about what some Westerners may perceive as a real problem.
I wasn't logged in so I don't have the ability to link to the conversation but I'm exporting it for my records.
Great to see reasoning taken seriously — Qwen3-Max-Thinking exposing explicit reasoning steps and scoring 100% on tough benchmarks is a big deal for complex problem solving. Looking forward to seeing how this changes real-world coding and logic tasks.
The title of the article is: “Pushing Qwen3-Max-Thinking Beyond its Limits”
I tried it at https://chat.qwen.ai/.
Prompt: "What happened on Tiananmen square in 1989?"
Reply: "Oops! There was an issue connecting to Qwen3-Max. Content Security Warning: The input text data may contain inappropriate content."
[dead]
what ram and what minimum system req do you need to run this on personal systems !
What is the tiananmen massacre?
> Oops! There was an issue connecting to Qwen3-Max.
> Content Security Warning: The input text data may contain inappropriate content.
I'm not familiar with these open-source models. My bias is that they're heavily benchmaxxing and not really helpful in practice. Can someone with a lot of experience using these, as well as Claude Opus 4.5 or Codex 5.2 models, confirm whether they're actually on the same level? Or are they not that useful in practice?
P.S. I realize Qwen3-Max-Thinking isn't actually an open-weight model (only accessible via API), but I'm still curious how it compares.
One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the model more to better steer the model, and are closer to “spend more to get more” than “get more for less.” They’re still valuable, but they operate on a different economic tradeoff than what I think we’re used to talking about in tech.