That is why we have SWE bench pro, they test architecture design too, turns out 1000 dollars of tokens outperform 10k dollars of labor in meta design.
That's just not accurate. I haven't studied SWE Bench Pro in detail, so I can't tell you exactly what the flaw is, but SOTA models routinely make bad architectural choices I have to intervene to fix.
1000 dollars of subsidized tokens.
That's just not accurate. I haven't studied SWE Bench Pro in detail, so I can't tell you exactly what the flaw is, but SOTA models routinely make bad architectural choices I have to intervene to fix.