Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result.
Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes.
Edit: To follow up, it seems to happen quite often. Out of 10 runs of the exact same prompt, 4/10 had this 516 thinking token issue, and every one of these had the wrong solution. So nearly half the time, 5.5 xhigh could be short circuiting and degrading performance. Granted the sample size is small.
This is preliminary, but it seems like it might somehow be related to the `## Intermediary updates` system prompt that's provided to the model. Seems like it forces the model to stop thinking and return early to provide updates. Removing that entirely makes all runs succeed [1].
I wonder if it's somehow getting confused between what's supposed to be an intermediate update vs the final result.
[1] https://github.com/openai/codex/issues/30364#issuecomment-48...
You still have to worry about misconfigured local models. Even the professionals get it wrong, which is why local model performance is uneven across providers.
I've made a passive workaround: a pair of Codex CLI hooks that detect the truncation from the local session transcript and warn — in the TUI at turn end, and via a message injected into the model's context on your next message.
I wonder if testing during different time/days show patterns? For example, whether the short circuiting happens more often during workday peak hours.
And people pay for those wasted tokens? If that's the case, it is probably good idea to ask for refunds.
I have a philosophical problem with adaptive thinking. It’s a dumb guess for how much thinking budget to allocate ahead of thinking. At least in the context of LLMs there is probably no way of knowing how much thinking (token generation) is needed. The problem space is infinity vast, similarly of two prompts is not going to help any LLM decide how much thinning is needed. Models already stop thinking before hitting the thinking budget.
Why there is so much effort in making adaptive thinking happen and don’t we train models to produce the end of thinning token better?
Feels like a bandaid. We need models to be trained to do a reasonable amount of reasoning (no pub intended):