I ran my custom agentic SQL debugging benchmark against it and I'm impressed. Results: 8 pass...

nl • today at 8:55 AM • 2 replies • view on HN

I ran my custom agentic SQL debugging benchmark against it and I'm impressed.

Results: 8 passed, 0 failed, 17 errored out of 25

That puts it right between Qwen3.5-4B (7/25) and Nanbeige4.1-3B (9/25) for example, but it took only 200 seconds for the whole test. Qwen3.5 took 976 seconds and Nanbeige over 2000 (although both of these were on my 1070 so not quite the same hardware)

Granite 7B 4bit does the test in 199 seconds but only gets 4/25 correct.

See https://sql-benchmark.nicklothian.com/#all-data (click on the cells for the trace of each question)

Errors are bad tool calls (vs failures which is incorrect SQL)

I used @freakynit's runpod (thanks!)

[1] https://news.ycombinator.com/item?id=47597268

Replies

Imustaskforhelp • today at 9:30 AM

I have been using @freakynit's runpod as well all be it, I like making working pomodoro apps as my own custom test, and although its not good for it (none of the prototypes work), I feel like it can be good within a specific context like Sql as you mention.

I imagine this being used as sub-agents with some sota models directing them but I wasn't really able to replicate it personally (I had asked Claude to create a detailed plan for a pomodoro app and then passed it to Bonsai)

I also tried its writing skills and actually they are kind-of decent, I also found that this model actually uses very comparatively little em-dashes.Its fine tunes are gonna be some really amazing things to come out. I hope someone makes a fine tune for website/tampermonkey extensions ;)

I remember using chatgpt-3 to use svelte/sveltekit to make a green button to blue button and having the text inside those buttons change and it's my personal wow moment from gpt-3 (This wasn't really able to accurately replicate it even in plain js), but I think that maybe the current model isn't good at writing html but the possibilities with custom-training these models and the idea of 1 bit model feels really great to me.

Especially with the idea of Ngram-embedding[0] (Meituanlongcat/LongCatFlashLite) and its idea. I imagine a 1 bit model + Ngram-embedding idea and I feel it can have many endless possibilities.

[0]: https://news.ycombinator.com/item?id=46803687 (I had submitted this but it seems to have had no attention during that time)

Maybe a 1 bit model like this and diffusion models for coding purposes might also go hand in hand, there are many experiments which can be done with this! (Also yes, many thanks to @freakynit running the runpod, I think I really learnt many things about this model in particular because of his runpod)

TLDR: I feel like this model is good within writing or atleast better in it than usual and it can be good asking it General purpose questions default but I feel like its not good at making html which can be fair, good to see that they are good in sql, but, not sure how they might approach in normal coding tasks. But either way, its an extremely fun model to play with!

(Edit: After some more tries, I have been able to make even one prototype of it after Gemini had holded its hands/giving it the code/errors, its not the best at this but still it works, just barely, https://gist.github.com/SerJaimeLannister/e90e8a134e4163f205...)

➕ show 1 reply

alt Hacker News

Replies