logoalt Hacker News

quantumleapertoday at 3:27 AM1 replyview on HN

Do you have benchmarks comparing against Pi? The blog post doesn't include any hard numbers.

For example, so far I haven't seen any evidence that LSP integration improves performance for small models vs using grep via a bash tool.


Replies

yogthostoday at 12:45 PM

I haven't really seen anybody come up with a good test to show hard numbers on comparing agentic harnesses. It's a bit tricky to set up a definitive test given the whole non deterministic nature of LLMs. What I've been focusing on is watching the loop and seeing where model does things that it shouldn't have to. For example, I notice models doing stuff like writing python scripts to match parens for Clojure all the time using editors like Pi. So, having a mechanical way to repair parens, and when that fails, to give the model clear error regarding where syntax is broken removes that whole cycle.

As it stands, it's kind of subjective, you just have to try the harness and see if the model seems to be have better than with the other ones you've been using.

show 1 reply