The harness point is probably the most important part here. With terminal agents, it often feels like the benchmark is measuring the whole loop: model, prompt, tool interface, retry policy, timeout handling, and recovery from bad shell commands.
Do you have a sense of which part contributed most to the jump?