logoalt Hacker News

Ornith-1.0: self-improving open-source models for agentic coding

254 pointsby danboarderyesterday at 5:16 PM48 commentsview on HN

Comments

CharlesWyesterday at 6:24 PM

Previously: https://news.ycombinator.com/item?id=48709744

https://swelljoe.com/post/will-it-mythos/: "Poor performer here, only found the one bug that almost every model found, despite its performance on other benchmarks being excellent for its size. […] It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive."

show 3 replies
ricardobayesyesterday at 8:05 PM

This is the first Qwen fine-tune that is not immediately rejected by the local LLM community, and in some cases even being recommended. Based on my limited usage, it is good, gives creative solutions to coding problems. I don't expect 9-35B models to one-click create full apps. Most people who were complaining did so .

show 4 replies
lhltoday at 12:59 PM

I've been testing Ornith-1.0 35B (my own FP8-block quant) and I like it. It runs at >200 tok/s w/ vLLM on an RTX PRO 6000 (sm120), I've run >140M cached tokens of agentic coding work on it over the past few days. It seems to about somewhere between Qwen 3.6 35B-A3B and 27B, but the good thing: it overthinks/doom-loop a lot less than Qwen 3.6. When looking at the thinking traces I like its breakdown approach template.

It does good job on basic analysis, tasks, and some front-end/backend changes on a medium-sized Go codebase, but it reached its limits totally botching a longer (simple) kernel implementation job (about 100 iterations in Pi Agent harness) - this is the type of thing that stronger open models (Kimi K2.6, GLM 5.2) are able to do.

show 1 reply
Narewtoday at 5:12 AM

From what I personally tested Ornith-1.0 35B is slightly better than Qwen-3.6 35B. My tests are tasks that consist of adding/modify feature in a big C++ codebase. The part that I find interesting is that the model is way faster than Qwen3.6 35B. It seems Ornith produce a smaller chain of thought. On my test it can be 3 time faster to produce the answer.

I use it via llamacpp and codex-cli.

kennywinkeryesterday at 6:23 PM

Can anyone explain what’s the story here? Is this just a re-skinned qwen? Who is deepreinforce-ai and why isn’t this model listed on their website?

How does it self-improve, does the model change on disk - or just during a single context run it gets better?

show 3 replies
GenseeAItoday at 4:00 PM

Self-improving systems are exciting, but they also make provenance and governance much harder. Once agents can modify their own behavior over time, understanding why an agent behaved a certain way becomes increasingly important.

S0yyesterday at 7:53 PM

These are simply benchmaxxed versions of either Qwen or Gemma 4.

show 2 replies
fareeshtoday at 11:21 AM

I've used a lot of local models and all of them felt like toys. This one actually felt useful. I hear Qwen 36-A3B is also good, yet to try that one.

giancarlostoroyesterday at 11:16 PM

> the dense 9B fits on a single 80GB GPU

Us mere mortals cannot use this.

show 1 reply
smcleodtoday at 12:47 PM

Weird they talk about their 31B dense model but haven't actually released it anywhere.

v3ss0nyesterday at 9:13 PM

Self-Improving bullshit. It is just Qwen 3.5 finetune benchmaxxed . Nothing spectacular . even fails at benchmarks. Long session tool calls sucks and hallucinate a lot with that too. Just use Qwen 3.6 and 3.5 122b.

RandyOriontoday at 4:45 AM

Glad to see more open models. However, where are the 31b models?

anana_yesterday at 8:12 PM

They keep mentioning a 31B dense model, but there are no benchmarks or weights for it anywhere?

agenticuptoday at 6:46 AM

can the orniths self scaffolding could learn to scaffold the rlm loop?

seanxxtoday at 11:08 AM

[flagged]

modgatetoday at 6:09 AM

[flagged]

fratefrittoyesterday at 8:54 PM

[flagged]

1105714today at 3:50 PM

[flagged]

jkwangtoday at 8:05 AM

[flagged]