See also https://marginlab.ai/trackers/claude-code-historical-perform... for a more conventional approach to track regressions
This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets