Seems to be llm written article and the tooling around the model is undeniably influenced by knowledge of the tests.
In all cases, GPT 3.5 isn’t a good benchmark for most serious uses and was considered to be pretty stupid, though I understand that isn’t the point of the article.
really appreciate you reading the article. the benchmark data, grading, and error classes were all done by hand though. the ~8.0 is the raw model with zero tooling, and the guardrail projections are documented separately. and yeah gpt-3.5 isn't the gold standard anymore, we're on the same page there. we just wanted to show that the quality people are still paying for can be free, private, and customized to whatever you need. thanks again for taking the time to check it out.