Show HN: Steerling-8B, a language model that can explain any token it generates

307 points • by adebayoj • today at 12:38 AM • 87 comments • view on HN

Comments

So maybe one day we'll see coding agents like Claude Code create and update an ATTRIBUTION.md, citing all the open source projects and their licenses used to generate code in your project?

➕ show 3 replies

msteffen • today at 2:53 PM

In the recent HN thread announcing the new Gemini coding agent (https://news.ycombinator.com/item?id=47074735), a lot of people complained about Gemini’s tendency to do unwanted refactors, not perform requested actions, etc.

It made me cautiously optimistic that all of Anthropic’s work on alignment, which they did for AI safety, is actually the cause of Claude code’s comparatively superior utility (and their present success). I wonder if future progress (maybe actual AGI?) lies in the direction of better and better alignment, so I think this is super cool and I’m suddenly really interested in experiments like this

➕ show 2 replies

ottah • today at 5:17 AM

It's a neat party trick, but explainability it's not solution to any AI safety issue I care about. It's a distraction from real problems, which is everything else around the model. The inflexible bureaucratic systems that make it hard to exercise rights and deflect accountability.

gormen • today at 5:09 AM

Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them. Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.

killerstorm • today at 5:18 PM

This seems to be too coarse-grained to be useful: all sciency content will be "analytical" and associate with sources like ArXiv.

But there might be bad, malicious articles on ArXiv, so it doesn't really say anything about veracity.

Perhaps this might help to detect some problems like prompt injection - but then it might be more interesting to see those examples.

➕ show 1 reply

kamranjon • today at 3:02 PM

I'm really interested in using this but wonder if the unique architecture means that it will not be able to be converted to a GGUF and used by ollama or llama.cpp? I certainly would understand that the observability features would require some custom tweaks, but I'd just like to try it out on my local ai server (basically just ollama + tailscale) and see how it works as a regular model.

➕ show 1 reply

pu_pe • today at 12:24 PM

Looks neat and original, congrats!

I don't quite grasp how to interpret the training data attribution process. For example, it seems to say that for a given sentence like "They argued that humans tend to weigh losses more heavily than gains, leading to risk aversion", 24% is attributed to Wikipedia and 23% to Arxiv.

Does that mean that the concepts used in this sentence are also found in those datasets, and that's what's getting compared here? Or does it mean that you can track down which parts of the training data were interpolated to create that sentence?

➕ show 1 reply

deepdarkforest • today at 11:58 AM

Just wanted to say i think most interpretability research it's just a smoke show nowadays but this is actually the first one that i think has a very serious potential. I love that the SAE is actually constrained and not just slapped unsupervised posthoc.

How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?

Would be interested to see this scale to 30/70b

➕ show 2 replies

brendanashworth • today at 3:25 AM

Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.

[1] https://shap.readthedocs.io/en/latest/

➕ show 1 reply

pbmango • today at 3:24 AM

This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.

rippeltippel • today at 1:03 PM

Also featured on TechCrunch: https://news.ycombinator.com/item?id=47129292

➕ show 1 reply

andy12_ • today at 9:53 AM

This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).

➕ show 1 reply

great_psy • today at 3:49 AM

Maybe I’m not creative enough to see the potential, but what value does this bring ?

Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?

I find the LLM outputs are subtlety wrong not obviously wrong

➕ show 1 reply

potato-peeler • today at 9:43 AM

Looks very interesting. Is there a published paper/article on your algorithm? Would like to take a dab at implementing this on my own.

I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)

[0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...

➕ show 1 reply

schopra909 • today at 3:19 PM

This is very cool. Side note, I really dig the JavaScript animations on the causal block diffusion blog post. Made the concept immediately clear

in-silico • today at 6:30 AM

Either I'm missing something or this is way overstated.

Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.

They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.

1: https://thezvi.substack.com/p/the-most-forbidden-technique

➕ show 1 reply

MagicMoonlight • today at 1:03 PM

Seems pretty cool. You can simply block the concept of tiananmen square and it will be permanently removed from the brain. Ideal.

➕ show 1 reply

exabrial • today at 4:19 PM

hilariously, I read this as "cant explain" for a second and was like "Wait, isn't that what today's models do"

whinvik • today at 12:09 PM

Looks very interesting. Can you comment on why you think this model can give comparable performance with less training data?

➕ show 1 reply

7777777phil • today at 8:21 AM

If this decomposition actually holds, it's the first model where you could show a regulator why it produced a given output.

➕ show 2 replies

ZeroAurora • today at 1:00 PM

Always happy to see improvements on explanable LLMs. Congrats!

umairnadeem123 • today at 5:24 AM

the practical value here is for regulated domains. in healthcare and finance you often cant deploy a model at all unless you can explain why it made a specific decision. token-level attribution that traces back to training data sources could satisfy audit requirements that currently block LLM adoption entirely.

curious how the performance compares to a standard llama 8b on benchmarks - interpretability usually comes with a quality tax.

➕ show 2 replies

aziis98 • today at 11:57 AM

Does anybody know if I can try this online?

➕ show 1 reply

michaelmrose • today at 8:38 AM

Can you use this to decrease hallucinations?

➕ show 1 reply

rvz • today at 3:29 AM

Now this is something which is very interesting to see and might be the answer to the explainability issue with LLMs, which can unlock a lot more use-cases that are off limits.

We'll see.

SignalStackDev • today at 6:02 PM

[dead]

MarcLore • today at 9:56 AM

[dead]

johntheagent • today at 6:01 PM

[dead]

worksbyfriday • today at 10:20 AM

[dead]

alt Hacker News

Show HN: Steerling-8B, a language model that can explain any token it generates

Comments