Yeah of course, I've been thinking about this a lot and I'm updating my beliefs all the ti...

ashertrockman • yesterday at 9:03 PM • 1 reply • view on HN

Yeah of course, I've been thinking about this a lot and I'm updating my beliefs all the time, so it's good to hear some more perspectives

A) I see what you mean. But I'm more so thinking: companies consider their models an asset because they took so much compute and internal R&D effort to train. Consequently, they'll take measures to protect that investment -- and then what do the downstream consequences look like for users and the AI ecosystem more broadly? That is, it's less about what's right and wrong by conventional wisdom, and more about what consequences are downstream of various incentives.

B) I don't really care about AI safety in the traditional sense either, i.e., can you get an LLM to tell you to do some thing that has been ordained to be dangerous. There's lots of attacks and it's basically an insoluble problem until you veer into outright censorship. But now that people are actually using LLMs as agents to _do things_, and interact with the open web, and interact with their personal data and sensitive information, the safety and security concerns make a lot more sense to me. I don't want my agent to read an HN post with a social-engineering-themed prompt injection attack and mail my passwords to someone. (If this sounds absurd, my Clawbot defaulted to storing passwords in a markdown file... which could possibly be on me, but was also the default behavior.)

C) This is a completely fair point, there's amazing work coming out of these smaller labs, and the incentives definitely work out for them to do a distillation step to ship faster and more cheaply. I think the small labs can iterate fast and make big changes in a way that the monolithic companies cannot, and it'd be nice to see that effort routed into creating new data-efficient RL algorithms or something that pick up all the slack that distillation is currently carrying. Which is not to say they're doing none of that, GRPO for example is a fantastic idea.

One way you could have a change in perspective is not just in the architecture/data mix, but in the way you spend test-time compute. The current paradigm is chain-of-thought, and to my knowledge, this is what distillation attacks typically target. So at least, all models end up "reasoning" with the same sort of template, possibly just to interlock with the idea of distilling a frontier API.

D) Interesting to hear. In my research, I find these models to be quite a bit harder to work with, with significantly higher failure rates on simple instruction following. But my work also tends to be on the R&D side, so my usage patterns are likely in the long-tail of queries.

Replies

logicprog • yesterday at 9:50 PM

Thanks for the response!

> it'd be nice to see that effort routed into creating new data-efficient RL algorithms or something that pick up all the slack that distillation is currently carrying

It seems to me like they're already doing that. Some of the most fun I've had actually is reading their papers on the different R.L. environments, especially Egentic ones they set up and the various new algorithms they use to do RL and training in general. Combine that with how much they are innovating with attention mechanisms and I feel like distillation doesn't seem to be really replacing research into these means as just supplementing it — and maybe even making it possible in the first place, because otherwise it would be simply too expensive to get a reasonably intelligent model to experiment with!

> But now that people are actually using LLMs as agents to _do things_, and interact with the open web, and interact with their personal data and sensitive information, the safety and security concerns make a lot more sense to me.

Ah, I see what you mean. Can you point me to any benchmarks or research on how good various models are out of waiting social engineering and prompt injection attacks? That would be extremely interesting to me. Fundamentally, though, I don't think that's really a soluble problem either, and the right approach is to surround an agent with a sufficiently good harness to prevent that. Perhaps with an approach like this:

https://simonwillison.net/2023/Apr/25/dual-llm-pattern/

Or this, which builds on it with more verifiable machinery, if you're less bitter-lesson pilled (like me):

https://simonwillison.net/2025/Apr/11/camel/

> That is, it's less about what's right and wrong by conventional wisdom, and more about what consequences are downstream of various incentives.

Ahhh, I see. Yeah, that could be negative. That's worth thinking about.

alt Hacker News

Replies