> Codex uses apply_patch: It takes a string as input, which is essentially an OpenAI-flavored diff, and instead of relying on a structured schema, the harness just expects this blob to follow a strict set of rules. Since OpenAI folks are without a doubt smart, I’m sure the token selection process is biased to fit this structure at the LLM gateway for the Codex variants of GPT, similar to how other constraints like JSON schemas or required tool calls work.
Codex does in fact use a schema for constrained sampling, it's here: https://github.com/openai/codex/blob/main/codex-rs/core/src/...
It still has to work to get an exact match, or at least I didn't read the code to see if there's any fuzzy matching used.
Note the two codex models were the only ones doing worse with the author's proposed format. The author found them doing better with replace than with apply patch, but since the author appears to be unaware that they use a schema for constrained sampling, I think a more realistic benchmark should enable constrained sampling for the apply test.