This is the hard part - especially with larger initiatives, it takes quite a bit of work to evaluate...

sally_glance • today at 12:54 AM • 0 replies • view on HN

This is the hard part - especially with larger initiatives, it takes quite a bit of work to evaluate what the current combination of harness + LLM is good at. Running experiments yourself is cumbersome and expensive, public benchmarks are flawed. I wish providers would release at least a set of blessed example trajectories alongside new models.

As it is, we're stuck with "yeah it seems this works well for bootstrapping a Next.js UI"...

alt Hacker News