Hi, I am the author, I completely agree! I set out to run a vibe test on this one, not a benchmark, ...

jameswhitford • today at 7:46 AM • 3 replies • view on HN

Hi, I am the author, I completely agree! I set out to run a vibe test on this one, not a benchmark, the real benchmarks are listed. My test shows what the models can do when both tasked with a long-running, technically difficult, one-shot task.

I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out.

Replies

wongarsu • today at 8:05 AM

Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard

➕ show 1 reply

ramraj07 • today at 5:29 PM

The important point is that your benchmark is pretty much irrelevant for the actual usage. Thus whatever conclusion you draw is not just irrelevant but misleading.

meander_water • today at 8:01 AM

Thanks, I didn't mean to be brusque, but I have seen a lot of these vibe tests lately that come to grand conclusions like "X model is better than Y" from the result of a single prompt.

Appreciate you sharing the results of your tests though!

➕ show 1 reply

alt Hacker News

Replies