I don’t know if they used the new ChatGPT to translate this page but I was served the French version and it is NOT good. There are placeholders for quotes like <quote> and the prose is incredibly repetitive. You’d figure that OpenAI of all people would be able to translate something to one of the worlds most spoken language.
I've been looking really hard at combining Roslyn (.NET compiler platform SDK) with one of these high end tool calling models. The ability to have the LLM create custom analyzers and then verify them with a human in the loop can provide stable, compile-time guarantees of business rules that accumulate without paying for context tokens.
I feel like there is a small chance I could actually make this work in some areas of the business now. 400k is a really big context window. The last time I made any serious attempt I only had 32k tokens to work with. I still don't think these things can build the whole product for you, but if you have a structured configuration abstraction in an existing product, I think there is definitely uplift possible.
Big knowledge cutoff jump from Sep 2024 to Aug 2025. How'd they pull that off for a small point release, which presumably hasn't done a fresh pre-training over the web?
Did they figure out how to do more incremental knowledge updates somehow? If yes that'd be a huge change to these releases going forward. I'd appreciate the freshness that comes with that (without having to rely on web search as a RAG tool, which isn't as deeply intelligent, as is game-able by SEO).
With Gemini 3, my only disappointment was 0 change in knowledge cutoff relative to 2.5's (Jan 2025).
There’s really no point in looking at benchmarks anymore as real world usage of these models varies between task and prompting strategies. Use your internal benchmarks to evaluate and ignore everything else. It is curious to me how they don’t provide a side x side comparison of other models benchmarks for this release
Wish they would include or leak more info about what this is, exactly. 5.1 was just released, yet they are claiming big improvements (on benchmarks, obviously). Did they purposely not release the best they had to keep some cards to play in case of Gemini 3 success or is this a tweak to use more time/tokens to get better output, or what?
Why doesn't OpenAI include comparisons to other models anymore?
This is a whole bunch of patting themselves on the back.
Let me know when Gemini 3 Pro and Opus 4.5 are compared against it.
I am really curious about speed/latency. For my use case there is a big difference in UX if the model is faster. Wish this was included in some benchmarks.
I will run 80 3D model generations benchmark tomorrow and update this comment with the results about cost/speed/quality.
Trying it now in Vscode Insiders with Github Copilot (codex crashes with HTTP 400 server errors), and it eventually started using sed and grep in shells instead of using the better tools it has access to. I guess this is not an issue to perform well in benchmarks.
This feels like "could've been an email" type of thing, a very incremental update that just adds one more version. I bet there is literally no one in the world who wanted *one more version of GPT* in the list of available models from OpenAI.
"All models" section on https://platform.openai.com/docs/models is quite ridiculous.
It's dog-doo-doo. I put in my algebraic geometry final review (100's of thousands of tokens) and Gemini instantly found all the propositions, theorems, and problems that I needed in a neat list (in about 5 seconds), meanwhile ChatGPT 5.2 Thinking took 10mins before timing out and not even completing the request.
So, does 5.2 still have a knowledge cutoff date of June 2024, or have they managed to complete another full pre-training run?
> it’s better at creating spreadsheets
I have a bad feeling about this.
Excited to try this. I’ve found Gemini excellent recently and amazing at coding. But I still feel somehow like ChatGPT understands more. Even though it’s not quite as good at coding - and nowhere at as fast. It is much less likely anti spontaneously forget something. Gemini’s is part unbelievably amazing and part amnesia patient. I still kinda trust ChatGPT more.
It seems like they fixed the most obvious issue with the last release, where codex would just refuse to do its job... if it seemed difficult or context usage was getting above 60% or so. Good job on the post-training improvements.
The benchmark changes are incredible, but I have yet to notice a difference in my codebases as of yet.
So the rosy biased estimate is OpenAI is saving 1 hour of work per day, so 5 hours total per-work week and 20 hours total per-month.
With a subsidized cost of $200/month for OpenAI it would be cheaper to hirer a part-time minimum wage worker than it would be to contract with OpenAI.
And that is the rosiest estimate OpenAI has.
>GPT‑5.2 sets a new state of the art across many benchmarks, including GDPval, where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations.
We built a benchmark tool that says our newest model outperforms everyone else. Trust me bro.
Discussion on blog post: https://openai.com/index/introducing-gpt-5-2/ (https://news.ycombinator.com/item?id=46234874)
GPT-5.2 just added to Vectara Hallucination Leaderboard. Definitely an improvement over GPT-5.1 - congrats to the team
OpenAI is really good at just saying stuff on the internet.
I love the way they talk about incorrect responses:
> Errors were detected by other models, which may make errors themselves. Claim-level error rates are far lower than response-level error rates, as most responses contain many claims.
“These numbers might be wrong because they were made up by other models, which we will not elaborate on, also these numbers are much higher by a metric that reflects how people use the product, which we will not be sharing“
I also really love the graph where they drew a line at “wrong half of the time” and labeled it ‘Expert-Level’.
10/10, reading this post is experientially identical to watching that 12 hours of jingling keys video, which is hard to pull off for a blog.
The ARC AGI 2 bump to 52.9% is huge. Shockingly GPT 5.2 Pro does not add too much more (54.2%) for the increase cost.
> new context management using compaction.
Nice! This was one of the more "manual" LLM management things to remember to regularly do, if I wanted to avoid it losing important context over long conversations. If this works well, this would be a significant step up in usability for me.
much better https://chatgpt.com/s/t_693b489d5a8881918b723670eaca5734 than 5.1 https://chatgpt.com/s/t_6915c8bd1c80819183a54cd144b55eb2.
Same query - what romanian football player won the premier league
update. Even instant returns correct result without problems
What the current preferred subscription on AI?
OpenAI and Anthrophic is my current preference. Looking forward to know what others use.
Claude Code for coding assistance and cross-checking my work. OpenAI for second opinion on my high-level decisions.
did they just tune the parameters? the hallucinations are crazy high on this version.
Still no GPT 5.x fine tuning?
I emailed support a while back to see if there was an early access program (99.99% sure the answer is yes). This is when I discovered that their support is 100% done by AI and there is no way to escalate a case to a human.
Is there a voice chat mode in any chat app that is not heavily degraded in reasoning?
I’m ok waiting for a response for 10-60 seconds if needed. That way I can deep dive subjects while driving.
I’m ok paying money for it, so maybe someone coded this already?
I'm continuously surprised that some people get good results out of GPT models. They sort of fail on my personal benchmarks for me.
Maybe GPT needs a different approach to prompting? (as compared to eg Claude, Gemini, or Kimi)
Is this the "Garlic" model people have been hyping? Or are we not there yet?
Somewhat tangential: The second link says "System card": https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...
Does that term have special meaning in the AI/LLM world? I never heard it before. I Google'd the term "System Card LLM" and got a bunch of hits. I am so surprised that I never saw the term used here in HN before.
Also, the layout looks exactly like a scientific paper written in LaTeX. Who is the expected audience for this paper?
https://platform.openai.com/docs/models/gpt-5.2 More information on the price, context window, etc.
Sweet Jesus. 53% on ARC-AGI-2. There's still gas in this van.
A bit off topic: but what's with the ram usage of LLM clients? ChatGPT, google, and Anthropic all use 1+ GB of ram during a long session. Surely they are not running GPT 3 locally?
Is this why all my Cursor requests are timing out in the past hour?
Man this was rushed, typo in the first section:
> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.
In other news, been using Devstral 2 (Ollama) with OpenCode, and while it's not as good as Claude Code, my initial sense it that it's nonetheless good enough and doesn't require me to send my data off my laptop.
I kind of wonder how close we are to alternative (not from a major AI lab) models being good enough for a lot of productive work and data sovereignty being the deciding factor.
How can I hide the big "Ask ChatGPT" button I accidentally clicked like 3 times while actually trying to read this on my phone?
I guess I must "listen" to the article...
It is significantly better than 5.1 .. testing now with codex. It's much more focused, perceptive and efficient.
For the first time, I’m presenting a problem to LLMs that they cannot seem to answer. This is my first instance of them “endlessly thinking” without producing anything.
The problem is complicated, but very solvable.
I’m programming video cropping into my Android application. It seems videos that have “rotated” metadata cause the crop to be applied incorrectly. As in, a crop applied to the top of a video actually gets applied to the video rotated on its side.
So, either double rotation is being applied somewhere in the pipeline, or rotation metadata is being ignored.
I tried Opus 4.5, Gemini 3, and Codex 5.2. All 3 go through loops of “Maybe Media3 applies the degree(90) after…”, “no, that’s not right. Let me think…”
They’ll do this for about 5 minutes without producing anything. I’ll then stop them, adjusting the prompt to tell them “Just try anything! Your first thought, let’s rapidly iterate!“. Nope. Nothing.
To add, it also only seems to be using about 25% context on Opus 4.5. Weird!
Marginal gains for exorbitantly pricey and closed model…..
Doesn’t seem like this will be SOTA in things that really matter, hoping enough people jump to it that Opus has more lenient usage limits for a while
They are talking a lot about economics, here. Wonder what that will mean for standard Plus users, like me.
Does anyone else consider that maybe it's impossible to benchmark the performance of a piece of paper.
This is a tool that allows an intelligent system to work with it, the same way that a piece of paper can reflect the writers' intelligence, how can we accurately judge the performance of the piece of paper, when it is so intimately reliant on the intelligence that is working with it?
How many years of the world's DRAM production capacity is it this time?
It's becoming challenging to really evaluate models.
The amount of intelligence that you can display within a single prompt, the riddles, the puzzles, they've all been solved or are mostly trivial to reasoners.
Now you have to drive a model for a few days to really get a decent understanding of how good it really is. In my experience, while Sonnet/Opus may not have always been leading on benchmarks, they have always *felt* the best to me, but it's hard to put into words why exactly I feel that way, but I can just feel it.
The way you can just feel when someone you're having a conversation with is deeply understanding you, somewhat understanding you, or maybe not understanding at all. But you don't have a quantifiable metric for this.
This is a strange, weird territory, and I don't know the path forward. We know we're definitely not at AGI.
And we know if you use these models for long-horizon tasks they fail at some point and just go off the rails.
I've tried using Codex with max reasoning for doing PRs and gotten laughable results too many times, but Codex with Max reasoning is apparently near-SOTA on code. And to be fair, Claude Code/Opus is also sometimes equally as bad at doing these types of "implement idea in big codebase, make changes too many files, still pass tests" type of tasks.
Is the solution that we start to evaluate LLMs on more long-horizon tasks? I think to some degree this was the spirit of SWE Verified right? But even that is being saturated now.