None of the examples reflect 'real work', at least not what I'd consider real work. B...

onion2k • yesterday at 5:34 PM • 8 replies • view on HN

None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.

The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.

Replies

janalsncm • yesterday at 6:20 PM

> Being able to nail a zero-shot greenfield project is relatively easy even for a small model

Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.

➕ show 4 replies

Aurornis • yesterday at 7:34 PM

> and it can fall back to similar examples in the training data easily.

This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.

My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.

Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.

If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.

The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.

➕ show 1 reply

Zambyte • yesterday at 7:22 PM

I have been using pi (and previously the codex cli) with Qwen 3.6 27b with 100k context for my development at work, and I have been very blown away by how well it works. It's not perfect, but it's enough to accelerate my normal development flow. I mostly use it for writing Go and C#.

sosodev • yesterday at 6:02 PM

In my experience, even with basic project concepts the small models struggle to spin up greenfield stuff. There's just too many decisions to be made and they're not good at that.

Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.

➕ show 2 replies

mark_l_watson • yesterday at 9:33 PM

There are several general types of tasks that a Gemma 4 12B class model works for me, including: 1) design a large project composed of small libraries that can be coded and tested in isolation. 2) clean up old coding projects: add README files, comment code, show an example of using a new API and have it update API use, etc.

All small-scale stuff. For large integrated projects I am finding DeepSeek v4 Pro commercial API to be very inexpensive and helps me produce good results.

internet101010 • today at 4:36 AM

Exactly. If the repo has all of the knowledge living inside of it that window fills up fast, even when using something like codegraph.

esafak • yesterday at 6:59 PM

I don't use local models but have you tried augmenting the model with code intelligence MCPs like https://github.com/DeusData/codebase-memory-mcp ?

h4ny • yesterday at 5:50 PM

> In my limited experiments Qwen 3.5 (maybe 3.6 is loads better)

1. Maybe you should tell us what those limited experiments are.

2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.

3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.

➕ show 1 reply

alt Hacker News

Replies