It’s also just a useful exercise in general, especially for getting feedback for models and harnesses.
I’ve been thinking about setting up a non trivial project to use as a benchmark for any plugins and/or harness changes I make.
Having a prebuilt verification suite is great. You can use it to asses things like token usage, time, across different harnesses, models, plugins.