logoalt Hacker News

GPT-5.5

1405 pointsby rdyesterday at 6:01 PM930 commentsview on HN

Comments

tedsandersyesterday at 6:13 PM

Just as a heads up, even though GPT-5.5 is releasing today, the rollout in ChatGPT and Codex will be gradual over many hours so that we can make sure service remains stable for everyone (same as our previous launches). You may not see it right away, and if you don't, try again later in the day. We usually start with Pro/Enterprise accounts and then work our way down to Plus. We know it's slightly annoying to have to wait a random amount of time, but we do it this way to keep service maximally stable.

(I work at OpenAI.)

show 14 replies
simonwyesterday at 7:24 PM

This doesn't have API access yet, but OpenAI seem to approve of the Codex API backdoor used by OpenClaw these days... https://twitter.com/steipete/status/2046775849769148838 and https://twitter.com/romainhuet/status/2038699202834841962

And that backdoor API has GPT-5.5.

So here's a pelican: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...

I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex

UPDATE: I got a much better pelican by setting the reasoning effort to xhigh: https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602...

show 19 replies
jfkimmesyesterday at 6:55 PM

Everyone talked about the marketing stunt that was Anthropic's gated Mythos model with an 83% result on CyberGym. OpenAI just dropped GPT 5.5, which scores 82% and is open for anybody to use.

I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!

Never thought I'd say this but OpenAI is the 'open' option again.

show 10 replies
Someone1234yesterday at 6:30 PM

I'd like to draw people's attention to this section of this page:

https://developers.openai.com/codex/pricing?codex-usage-limi...

Note the Local Messages between 5.3, 5.4, and 5.5. And, yes, I did read the linked article and know they're claiming that 5.5's new efficient should make it break-even with 5.4, but the point stands, tighter limits/higher prices.

show 3 replies
minimaxiryesterday at 6:08 PM

The more interesting part of the announcement than "it's better at benchmarks":

> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.

The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.

show 3 replies
astlouis44yesterday at 6:10 PM

A playable 3D dungeon arena prototype built with Codex and GPT models. Codex handled the game architecture, TypeScript/Three.js implementation, combat systems, enemy encounters, HUD feedback, and GPT‑generated environment textures. Character models, character textures, and animations were created with third-party asset-generation tools

The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.

It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.

show 11 replies
6thbityesterday at 7:19 PM

                          Mythos     5.5
    SWE-bench Pro          77.8%*   58.6%
    Terminal-bench-2.0     82.0%    82.7%*
    GPQA Diamond           94.6%*   93.6%
    H. Last Exam           56.8%*   41.4%
    H. Last Exam (tools)   64.7%*   52.2%    
    BrowseComp             86.9%    84.4%  (90.1% Pro)*
    OSWorld-Verified       79.6%*   78.7%

Still far from Mythos on SWE-bench but quite comparable otherwise. Source for mythos values: https://www.anthropic.com/glasswing
show 4 replies
silvertazayesterday at 7:57 PM

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...

show 3 replies
mudkipdevyesterday at 7:11 PM

This is 3x the price of GPT-5.1, released just 6 months ago. Is no one else alarmed by the trend? What happens when the cheaper models are deprecated/removed over time?

show 10 replies
applfanboysbgonyesterday at 6:07 PM

If there's a bingo card for model releases, "our [superlative] and [superlative] model yet" is surely the free space.

show 4 replies
vthallamyesterday at 7:03 PM

This model is great at long horizon tasks, and Codex now has heartbeats, so it can keep checking on things. Give it your hardest problem that would take hours with verifiable constraints, you will see how good this is:)

*I work at OAI.

show 5 replies
aliljetyesterday at 7:12 PM

I've found myself so deeply embedded in the Claude Max subscription that I'm worried about potentially makign a switch. How are people making sure they stay nimble enough not to get trarpped by one company's ecosystem over another? For what it's worth, Opus 4.7 has not been a step up and it's come with an enormously higher usage of the subscription Anthropic offers making the entire offering double worse.

show 16 replies
_alternator_yesterday at 8:23 PM

> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”

This quote is more sinister than I think was intended; it likely applies to all frontier coding models. As they get better, we quickly come to rely on them for coding. It's like playing a game on God Mode. Engineers become dependent; it's truly addictive.

This matches my own experience and unease with these tools. I don't really have the patience to write code anymore because I can one shot it with frontier models 10x faster. My role has shifted, and while it's awesome to get so much working so quickly, the fact is, when the tokens run out, I'm basically done working.

It's literally higher leverage for me to go for a walk if Claude goes down than to write code because if I come back refreshed and Claude is working an hour later then I'll make more progress than mentally wearing myself out reading a bunch of LLM generated code trying to figure out how to solve the problem manually.

Anyway, it continues to make me uneasy, is all I'm saying.

show 42 replies
h14hyesterday at 6:47 PM

This seems huge for subscription customers. Looking at the Artificial Analysis numbers, 5.5 at medium effort yields roughly the intelligence as 5.4 (xhigh) while using less than a fifth the tokens.

As long as tokens count roughly equally towards subscription plan usage between 5.5 & 5.4, you can look at this as effectively a 5x increase in usage limits.

show 1 reply
BrokenCogsyesterday at 6:19 PM

I'm here for the pelicans and I'm not leaving until I see one!

show 5 replies
CompleteSkepticyesterday at 7:23 PM

Is this the first time OpenAI has published comparisons to other labs?

Seems so to me - see GPT-5.4[1] and 5.2[2] announcements.

Might be an tacit admission of being behind.

[1] https://openai.com/index/introducing-gpt-5-4/ [2] https://openai.com/index/introducing-gpt-5-2/

show 1 reply
khutornitoday at 5:34 AM

> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.”

That's a wild statement to put into your announcement. Are LLM providers now openly bragging about our collective dependency on their models?

gallerdudeyesterday at 6:41 PM

If GPT-5.5 Pro really was Spud, and two years of pretraining culminated in one release, WOW, you cannot feel it at all from this announcement. If OpenAI wants to know why they like they’ve fallen behind the vibes of Anthropic, they need to look no further than their marketing department. This makes everything feel like a completely linear upgrade in every way.

show 2 replies
jryioyesterday at 6:12 PM

Their 'Preparedness Framework'[1] is 20 pages and looks ChatGPT generated, I don't feel prepared reading it.

https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbdde...

louiereedersonyesterday at 6:16 PM

For a 56.7 score on the Artificial Intelligence Index, GPT 5.5 used 22m output tokens. For a score of 57, Opus 4.7 used 111m output tokens.

The efficiency gap is enormous. Maybe it's the difference between GB200 NVL72 and an Amazon Tranium chip?

show 4 replies
ativzzzyesterday at 6:08 PM

I like that they waited for opus 4.7 to come out first so they had a few days to find the benchmarks that gpt 5.5 is better at

show 2 replies
sosodevyesterday at 6:30 PM

I hope the industry starts competing more on highest scores with lowest tokens like this. It's a win for everybody. It means the model is more intelligent, is more efficient to inference, and costs less for the end user.

So much bench-maxxing is just giving the model a ton of tokens so it can inefficiently explore the solution space.

show 1 reply
vanillameowtoday at 9:39 AM

Because Opus is kind of degrading lately, I said "fuck it" and made a new OAI account and used the month free trial. I put one query into ChatGPT using 5.5 thinking - the frustrating thing was that it did put more effort into getting correct answers rather than Opus, which is just guessing. Specifically, I asked about the coding harness pi, and despite explicitly referring to it as a harness, Opus 4.7, 4.6 and Sonnet 4.6 all fell back to telling me about Aider or OpenCode and ignored my query completely, while ChatGPT said "I'll assume pi is a harness" and then did in fact find the harness.

However the language of ChatGPT is still the same slop as years ago, so many headings, so many emojis, so many "the important thing nobody mentions". 10 paragraphs of text for what should be a two paragraph response. Even with custom instructions (keep answers short and succinct) and using their settings (less list, less emoji, less fluff) it's still NOTICEABLY worse than Claude on base settings.

I've yet to test Codex, will get to that this weekend, but in terms of research or general Q&A I have no idea how anyone could prefer this to Claude. Unfortunately Claude has seemingly stopped giving a fuck about researching entirely.

blixtyesterday at 9:22 PM

Releases keep shifting from API forward to product forward, with API now lagging behind proprietary product surface and special partnerships.

I'd not be surprised if this is the year where some models simply stop being available as a plain API, while foundation model companies succeed at capturing more use cases in their own software.

show 1 reply
losvediryesterday at 6:33 PM

> It excels at ... researching online

How does this work exactly? Is there like a "search online" tool that the harness is expected to provide? Or does the OpenAI infra do that as part of serving the response?

I've been working on building my own agent, just for fun, and I conceptually get using a command line, listing files, reading them, etc, but am sort of stumped how I'm supposed to do the web search piece of it.

Given that they're calling out that this model is great at online research - to what extent is that a property of the model itself? I would have thought that was a harness concern.

show 3 replies
2001zhaozhaoyesterday at 6:23 PM

Pricing: $5/1M input, $30/1M output

(same input price and 20% more output price than Opus 4.7)

show 3 replies
baalimagoyesterday at 6:15 PM

Worth the 100% price increase over GPT-5.4?

show 2 replies
vessenesyesterday at 6:24 PM

Yay. 5.4 was a frustrating model - moments of extreme intelligence (I liked it very much for code review) - but also a sort of idiocy/literalism that made it very unsuited for prompting in a vague sense. I also found its openclaw engagement wooden and frustrating. Which didn’t matter until anthropic started charging $150 a day for opus for openclaw.

Anyway - these benchmarks look really good; I’m hopeful on the qualitative stuff.

thinkindieyesterday at 8:38 PM

This is reminding me when Chrome and Firefox where racing to release a new “major version” (at least from the semver POV) without adding significantly new functionality at a time that browsers were already becoming a commodity. As much as we don’t care anymore for a new chrome or Firefox version so will be the release of a new model version.

show 1 reply
NitpickLawyeryesterday at 6:53 PM

> Across all three evals, GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

Yeah, this was the next step. Have RLVR make the model good. Next iteration start penalising long + correct and reward short + correct.

> CyberGym 81.8%

Mythos was self reported at 83.1% ... So not far. Also it seems they're going the same route with verification. We're entering the era where SotA will only be available after KYC, it seems.

show 3 replies
kburmanyesterday at 8:15 PM

What a time. I am back here genuinely wishing for OpenAI to release a great model, because without stiff competition, it feels like Anthropic has completely lost its mind.

show 1 reply
amiunetoday at 11:12 AM

Will there ever be ChatGPT 6.0 or Claude 5.0?

nickvecyesterday at 7:41 PM

I'm conflicted whether I should keep my Claude Max 5x subscription at this point and switch back to GPT/Codex... anyone else in a similar position? I'd rather not be paying for two AI providers and context switching between the two, though I'm having a hard time gauging if Claude Code is still the "cream of the crop" for SWE work. I haven't played around with Codex much.

show 5 replies
svaratoday at 6:54 AM

Do we know if this is another post training fine tune or based on a much larger new pretraining run (which I believe they were calling 'Spud' internally)?

The large price bump might indicate the latter.

xingyi_devtoday at 8:17 AM

Its coding chops are absolutely insane. Opus 4.7 was already a tough sell, but Gpt 5.5 just made it completely irrelevant.

show 1 reply
kaanttoday at 8:35 AM

The '.5' models are always the actual production-ready versions. GPT-5 was for the mainstream hype, 5.5 is for the developers. I don't need it to be magically smarter; just give me lower latency, cheaper API tokens, and reliable tool-calling without hallucinations.

ZeroCool2uyesterday at 6:07 PM

Benchmarks are favorable enough they're comparing to non-OpenAI models again. Interesting that tokens/second is similar to 5.4. Maybe there's some genuine innovation beyond bigger model better this time?

show 1 reply
Flowtoday at 9:57 AM

These new models consume so many tokens. I’m very satisfied with GPT-5.2 on High. I hope they keep that one for many years

M4R5H4LLyesterday at 8:17 PM

I am a heavy Claude Code user. I just tried using Codex with 5.4 (as a Plus user I don't have access to 5.5 yet), and it was quite underwhelming. The app stopped regularly much earlier than what I wanted. It also claimed to have fixed issues when it did not; this is not a hallmark of GPT, and Opus has similar issues, but Claude will not make the same mistake three times in a row. It is unusable at the moment, while Claude allows me do get real work done on a daily basis. Until then...

show 1 reply
jdw64yesterday at 6:10 PM

GPT is really great, but I wish the GPT desktop app supported MCP as well.

You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.

show 1 reply
niklasdtoday at 6:59 AM

Just burned through my 5 hour window in Codex (Business plan) in 10 minutes with GPT-5.5. Was excited to use it, but I guess I have to wait 5 hours now (it's not yet available in the API, so I can't switch there).

neuroelectrontoday at 11:20 AM

Are they using RTX 5090s now?

Rapzidyesterday at 7:52 PM

In Copilot where it's easy to switch models Opus 4.6 was still providing, IMHO, better stock results than GPT-5.4.

Particularly in areas outside straight coding tasks. So analysis, planning, etc. Better and more thorough output. Better use of formatting options(tables, diagrams, etc).

I'm hoping to see improvements in this area with 5.5.

thimabiyesterday at 6:26 PM

Will we also see a GPT-5.5-Codex version of this model? Or will the same version of it be served both in the web app and in Codex?

show 1 reply
jumploopsyesterday at 6:17 PM

> GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

This might be great if it translates to agentic engineering and not just benchmarks.

It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.

Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.

show 2 replies
cscheidyesterday at 7:44 PM

I know this is irrelevant on the grand scheme of things, but that WebGL animation is really quite wrong. That is extra funny given the "ensure it has realistic orbital mechanics." phrase in the prompt.

I prescribe 20 hours of KSP to everyone involved, that'll set them right.

RayVRtoday at 11:13 AM

My first experience with 5.5 via ChatGPT was immensely disappointing. It was a massive reduction in quality compared to 5.4, which already had issues.

gcanyonyesterday at 10:46 PM

Once upon a time humans had to memorize log tables.

Once upon a time humans had to manually advance the spark ignition as their car's engine revved faster.

Once upon a time humans had to know the architecture of a CPU to code for it.

History is full of instances of humans meeting technology where it was, accommodating for its limitations. We are approaching a point where machines accommodate to our limitations -- it's not a point, really, but a spectrum that we've been on.

It's going to be a bumpy ride.

show 1 reply
maxdoyesterday at 11:09 PM

With such a huge progress of open ai and anthropic . How Chinese open source provides even think to make comparable money . I have a few friends in China they all use Claude. To train the model cost the same but the output from open source model id imagine is 1000 times less . Money flow for them outside of China is abysmal

🔗 View 50 more comments