logoalt Hacker News

adebayojyesterday at 8:02 AM3 repliesview on HN

op here. Important point, but I disagree. We see explainability/interpretability as a CORE need for AI safety. We believe you can't align/audit/debug/fix a system that you don't understand.

Just to give you some answers for what we can do:

1) We can find the training data that is causing a model to output toxic/unwanted text and correct it. 2) We know what high level concepts the model is relying on for any group of tokens it generates, hence, reducing that generation is as simple as toggling the effect of the output on that concept.

Most of the AI safety techniques fall under finetuning. Our model allows your to do this without fine-tuning. You can toggle the presence of .

For example, wouldn't you like to know why a model is being sycophantic? Or Sandbagging? Is it a particular kind of training data that is causing this? Or is it some high level part of the model's representations? For any of this, our model can tell you exactly why the model generated that output. Over the coming weeks, we'll show exactly how you can do this!


Replies

vintagedaveyesterday at 9:12 AM

This is fantastic to read. LLMs feel like black boxes and for the large ones especially I have a sense they genuinely form concepts. Yet the internals were opaque. I remember reading how LLMs cannot explain their own behaviour when asked.

I feel this would give insight into all that including the degree of true conceptualisation. I’m curious if this can demonstrate what else the model is aware of when answering, too.

show 1 reply
ottahyesterday at 9:02 PM

> wouldn't you like to know why a model is being sycophantic? Or Sandbagging?

Actually, emphatically no. The only thing I care about is that I have recourse. It shouldn't matter the reason, in fact explainability can be an impediment to accountability. It's just another plausible barrier to a remedy that a bureaucracy can use deny changing a decision.

0xdeadbeefbabeyesterday at 3:59 PM

Hmm so like git blame?