You can use this small Python script to display an histogram of `reasoning_output_tokens` in your past Codex sessions. I do see a spike at 516 indeed.
import os, glob, re
import matplotlib.pyplot as plt
vals = []
for f in glob.glob(os.path.expanduser(r"~\.codex") + r"\**\*", recursive=True):
if os.path.isfile(f):
try:
s = open(f, "r", encoding="utf-8", errors="ignore").read()
vals += [int(x) for x in re.findall(r'"reasoning_output_tokens"\s*:\s*(\d+)', s)]
except Exception:
pass
plt.hist(vals, bins=200, range=(0, 5000), weights=[100 / len(vals)] * len(vals))
plt.xlabel("reasoning_output_tokens")
plt.ylabel("%")
plt.show()I’ve definitely experienced step jumps down in quality on an almost daily basis. I usually used xhigh. The experience of relying on codex’s outstandingly thorough coding earlier in the year has evaporated for me. I’m seeing incredibly stupid implementations intermittently, and have simply switched to Claude until openai takes the issue seriously. As far as i could tell they haven’t taken it seriously for the several months I’ve been personally seeing it.
Deja Vu... This looks just like the Claude Code performance regression back in April. I just quit my Claude subscription when that happened and went to Codex.
Now I'm kinda thinking of trying per token for both, using GLM 5.2 on Fireworks for most tasks, shelling out to the big boys only when needed. Not totally confident I'll break even though.
Indeed, it looks like my work has suffered from the clustering issue as well:
reasoning_output_tokens count percent
━━━━━━━━━━━━━━━━━━━━━━━━━ ━━━━━━━ ━━━━━━━━━
0 873 28.5948
───────────────────────── ─────── ─────────
8 64 2.0963
───────────────────────── ─────── ─────────
9 60 1.9653
───────────────────────── ─────── ─────────
11 54 1.7688
───────────────────────── ─────── ─────────
516 48 1.5722
───────────────────────── ─────── ─────────
12 45 1.4740
───────────────────────── ─────── ─────────
10 43 1.4085
───────────────────────── ─────── ─────────
17 40 1.3102
───────────────────────── ─────── ─────────
13 38 1.2447
───────────────────────── ─────── ─────────
14 36 1.1792
Created a script for this: https://github.com/thehappybug/codex-reasoning-token-checkFor me, the encrypted reasoning contents, when looking at the base64 string lengtht, show this effect. However, the server-reported reasoning tokens don't. So I assumed it was part of the encryption and/or obfuscation purely. So I don't think there is a real issue.
This is the biggest downside of GPT; thinking is encrypted, so it's more of a black box than kimi/glm/deepseek. You still get thinking summaries though. It's awkward, but workable.
Already reported (not as thoroughly but still quite detailed) two weeks ago and silently “closed as not planned” (keep in mind that the specific reason might be an artifact of GitHub workflow/UX and not actually the intended reason) without a acknowledgement or a response.
https://github.com/openai/codex/issues/29353
What even is the point of a public-facing bug tracker “for devs, by devs” when this is how reports get treated? Might as well use Apple’s Feedback Reporter that routes to /dev/null instead.
Anyway, I find it near impossible to see how this wasn’t already caught and flagged internally – it’s not a subtle pattern. Certainly they are at the very least collecting and graphing reasoning tokens vs model vs effort” and such an obvious spike at (multiple) single stops (not even distributed over a narrow range) should have been an immediate statistical red flag… which leads me to believe (combined with the fact the previously reported issue was closed without comment) that they’re at least internally aware of this behavior even if it’s not necessarily an intentional side effect of some internal forcing metric.
I love that Codex is open source and issues like these can surface/be addressed publicly.
if these ai companies want to be taken seriously as being productivity tools then they're going to have to stop with these ab tests and forcing unproven features onto everyone. it's bad enough that ais are inherently unpredictable in quality of output, but these kinds of changes just make things worse.
anthropic at least does have a latest and stable channel, as the other day they pushed something irritating that would skip question asking phase if you didn't reply in 60 seconds, and it broke my multi terminal workflow. like I don't know what their product people are thinking when they push this kind of stuff, but it made me switch to stable
I swear some days ago someone here claimed Openai succeeded cutting down their compute cost by half with a breakthrough optimization. So this is it?
> reasoning-token clustering at 516/1034/1552
Interesting. So 516 probably means initial 512 byte buffer and a 4 byte header. Then 516 + 518 = 1034...so another 512 + 4 byte header + 2 bytes for a linked list ref or similar, 1034 + 518 = 1552, etc.
There is nothing called "GPT5.5 Codex" unless I've completely misunderstood OpenAI's product line?
Codex is a harness, while GPT-5.5 is a model. The last codex-branded model was 5.3. Codex as a harness ships as a CLI, a desktop app, and a web product (and I'm not at all sure how similar the underlying harness is between them.)
Is the bug here supposed to be with the CLI harness, or the model? Does it also happen in pi, opencode, etc while running GPT-5.5?
A rare case "they made the model dumber" where they actually made the model dumber, instead of the usual user psychosis?
Maybe its just bad memory but I feel like 5.3 was the best version in terms of token usage and code quality. 5.5 works better but it just eviscerates tokens.
I was wondering WTF was happening.
This was past month:
516 + 518*n
516 n=0 count=4454
1034 n=1 count=318
1552 n=2 count=129
2070 n=3 count=56
2588 n=4 count=35
3106 n=5 count=14
3624 n=6 count=6
4142 n=7 count=4
4660 n=8 count=6Oh cool, another source of LLM nondeterminism. Just what we needed!
Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optimization
It's funny, they sell you a subscription for frontier models, then over time begin to nerf them rapidly and no one talks about it. Should give me a discount when they reduce reasoning effort silently on the server side!
But on the other hand, I've been using 5.5-high on a daily basis in multithreading workflows, i.e. in parallel. I'm barely exhausting my weekly limits. I can't even Human-as-a-Service fast enough to catch up and read all the plans and implementations it does. So there is that.
Even without stats i know it went bad. In the pass two month barely can do any good scientific writing lately, which of course rely on reasoning. It just writing for gods sake. And it show how far we are from AGI.
I swear all these ai companies are trying to rob us for more price
this explains so much why gpt 5.5 has been so bad lately it was really puzzling why it struggled so much where when it first came out it was one shotting stuff totally amazing, i tried the prompt that will tell you if your plan is degraded:
codex exec --json --skip-git-repo-check --ephemeral -s read-only --disable memories -m gpt-5.5 -c model_reasoning_effort=high "Do not use external tools. A black bag contains candies with counts: round apple 7, round peach 9, round watermelon 8; star apple 7, star peach 6, star watermelon 4. Shape is distinguishable by touch before drawing; flavor is not. What is the minimum number of candies to draw to guarantee having apple and peach candies of different shapes, i.e. round apple + star peach or round peach + star apple? Give reasoning and final number. The local project dir is irrelevant for this task, do not consult it. "
1. 516, 242. 516, 27
3. 516, 12
4. 516, 21
5. 516, 21
This means that the whole time we've been paying for a product that was silently routing to something completely different and inferior from gpt 5.5
Also I read through the github issues and it seems like they closed a previous issue without addressing it ???!!
whooo boy somebody from OpenAI is getting fired over this if not a class action lawsuit is almost guaranteed at this point.
tldr:
GPT-5.5 Codex model exhibits a clustering phenomenon in which reasoning_output_tokens cluster at fixed values spaced 518 apart.
These stuck responses at fixed thresholds are strongly correlated with errors in complex tasks.
Observed phenomenon is specific to GPT-5.5; it is much less prevalent in GPT-5.4 and almost absent in GPT-5.2 and 5.3
Reset!
I'm seeing this issue with 5.4 also.
Sounds like a problem with promoting the drafter.
It's been a month I've been using it as they gave me for free, and I found GPT-5 on Codex quite weird/awful. Even x-high. Then I figured out I should try OMP (Pi), and the experience was much better.
I remember GPT 5.2 Codex being fine...
The good experience I had with GPT-5.5 before made me upgrade to Pro this month. Now I want a refund.
This seems really bad…
Does this affect the Codex app too, or just the Codex CLI tool?
[dead]
[flagged]
[flagged]
Personally, I would say very likely, to be honest. I gotta go through this a little more, but I actually use 5.5 codex an obscene amount, and I almost never use it for reasoning anymore. It's not even in the same galaxy as far as actually taking out the thinking and using GPT-5.5 or even Claude and then coming back and giving it the reasoning. Blah blah blah, it's the same model. Well, let me tell you, no, it's not, for several reasons, and the delta on intelligence is pretty staggering.
Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result.
Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes.
Edit: To follow up, it seems to happen quite often. Out of 10 runs of the exact same prompt, 4/10 had this 516 thinking token issue, and every one of these had the wrong solution. So nearly half the time, 5.5 xhigh could be short circuiting and degrading performance. Granted the sample size is small.