logoalt Hacker News

88j88yesterday at 10:53 PM1 replyview on HN

Something very similar I was experimenting with on, but had different results that you may be interested in, some of my findings were interesting

This was part of testing out how well a tool of mine worked (github.com/jsuppe/loom), which aims to be used to extracts requirements, specs, creates tests. At first I had no intention of using it for code generation but then tried it out with some early success. I tried splitting the work by using the tool with different frontier models, and then providing work to a local ollama instance running one of several models. Not all local models had the same outcome, not all coding languages had the same outcome. I also found in this experiment, when nailing down the coding tasks I wanted to set up positive and negative scenarios- which is where I found setting guardrails can sometimes backfire with inversion- this essentially elaborates on previous work by Khan 2025 (https://arxiv.org/abs/2510.22251); the most interesting finding to me was that if you give guardrails with a rationale, it reduces compliance and may cause the inversion

For coding tasks I found that the improvement was not only ability to use a lower cost model for these broken down tasks, but wall clock time was improved over using frontier model alone, with equivalent outcomes.


Replies

zambelliyesterday at 11:18 PM

I've had a few reversions as well along the way, including in upcoming v0.7.0 patch. Some models benefitted, others regressed - overall better on harder scenarios or I wouldn't be releasing, but yeah - not intuitive.

The biggest challenge has been balancing the desire to hyper optimize for my favorite models, versus average behavior, versus consumer needs.