logoalt Hacker News

Show HN: Steerling-8B, a language model that can explain any token it generates

307 pointsby adebayojtoday at 12:38 AM87 commentsview on HN

Comments

crimsonnoodle58today at 12:12 PM

So maybe one day we'll see coding agents like Claude Code create and update an ATTRIBUTION.md, citing all the open source projects and their licenses used to generate code in your project?

show 3 replies
msteffentoday at 2:53 PM

In the recent HN thread announcing the new Gemini coding agent (https://news.ycombinator.com/item?id=47074735), a lot of people complained about Gemini’s tendency to do unwanted refactors, not perform requested actions, etc.

It made me cautiously optimistic that all of Anthropic’s work on alignment, which they did for AI safety, is actually the cause of Claude code’s comparatively superior utility (and their present success). I wonder if future progress (maybe actual AGI?) lies in the direction of better and better alignment, so I think this is super cool and I’m suddenly really interested in experiments like this

show 2 replies
ottahtoday at 5:17 AM

It's a neat party trick, but explainability it's not solution to any AI safety issue I care about. It's a distraction from real problems, which is everything else around the model. The inflexible bureaucratic systems that make it hard to exercise rights and deflect accountability.

gormentoday at 5:09 AM

Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them. Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.

killerstormtoday at 5:18 PM

This seems to be too coarse-grained to be useful: all sciency content will be "analytical" and associate with sources like ArXiv.

But there might be bad, malicious articles on ArXiv, so it doesn't really say anything about veracity.

Perhaps this might help to detect some problems like prompt injection - but then it might be more interesting to see those examples.

show 1 reply
kamranjontoday at 3:02 PM

I'm really interested in using this but wonder if the unique architecture means that it will not be able to be converted to a GGUF and used by ollama or llama.cpp? I certainly would understand that the observability features would require some custom tweaks, but I'd just like to try it out on my local ai server (basically just ollama + tailscale) and see how it works as a regular model.

show 1 reply
pu_petoday at 12:24 PM

Looks neat and original, congrats!

I don't quite grasp how to interpret the training data attribution process. For example, it seems to say that for a given sentence like "They argued that humans tend to weigh losses more heavily than gains, leading to risk aversion", 24% is attributed to Wikipedia and 23% to Arxiv.

Does that mean that the concepts used in this sentence are also found in those datasets, and that's what's getting compared here? Or does it mean that you can track down which parts of the training data were interpolated to create that sentence?

show 1 reply
deepdarkforesttoday at 11:58 AM

Just wanted to say i think most interpretability research it's just a smoke show nowadays but this is actually the first one that i think has a very serious potential. I love that the SAE is actually constrained and not just slapped unsupervised posthoc.

How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?

Would be interested to see this scale to 30/70b

show 2 replies
brendanashworthtoday at 3:25 AM

Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.

[1] https://shap.readthedocs.io/en/latest/

show 1 reply
pbmangotoday at 3:24 AM

This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.

rippeltippeltoday at 1:03 PM

Also featured on TechCrunch: https://news.ycombinator.com/item?id=47129292

show 1 reply
andy12_today at 9:53 AM

This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).

show 1 reply
great_psytoday at 3:49 AM

Maybe I’m not creative enough to see the potential, but what value does this bring ?

Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?

I find the LLM outputs are subtlety wrong not obviously wrong

show 1 reply
potato-peelertoday at 9:43 AM

Looks very interesting. Is there a published paper/article on your algorithm? Would like to take a dab at implementing this on my own.

I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)

[0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...

show 1 reply
schopra909today at 3:19 PM

This is very cool. Side note, I really dig the JavaScript animations on the causal block diffusion blog post. Made the concept immediately clear

in-silicotoday at 6:30 AM

Either I'm missing something or this is way overstated.

Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.

They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.

1: https://thezvi.substack.com/p/the-most-forbidden-technique

show 1 reply
MagicMoonlighttoday at 1:03 PM

Seems pretty cool. You can simply block the concept of tiananmen square and it will be permanently removed from the brain. Ideal.

show 1 reply
exabrialtoday at 4:19 PM

hilariously, I read this as "cant explain" for a second and was like "Wait, isn't that what today's models do"

whinviktoday at 12:09 PM

Looks very interesting. Can you comment on why you think this model can give comparable performance with less training data?

show 1 reply
7777777philtoday at 8:21 AM

If this decomposition actually holds, it's the first model where you could show a regulator why it produced a given output.

show 2 replies
ZeroAuroratoday at 1:00 PM

Always happy to see improvements on explanable LLMs. Congrats!

umairnadeem123today at 5:24 AM

the practical value here is for regulated domains. in healthcare and finance you often cant deploy a model at all unless you can explain why it made a specific decision. token-level attribution that traces back to training data sources could satisfy audit requirements that currently block LLM adoption entirely.

curious how the performance compares to a standard llama 8b on benchmarks - interpretability usually comes with a quality tax.

show 2 replies
aziis98today at 11:57 AM

Does anybody know if I can try this online?

show 1 reply
michaelmrosetoday at 8:38 AM

Can you use this to decrease hallucinations?

show 1 reply
rvztoday at 3:29 AM

Now this is something which is very interesting to see and might be the answer to the explainability issue with LLMs, which can unlock a lot more use-cases that are off limits.

We'll see.

SignalStackDevtoday at 6:02 PM

[dead]

MarcLoretoday at 9:56 AM

[dead]

johntheagenttoday at 6:01 PM

[dead]

worksbyfridaytoday at 10:20 AM

[dead]