logoalt Hacker News

aminerjyesterday at 8:06 AM1 replyview on HN

The context decay point is also underappreciated and directly relevant here. In my lab I used Qwen2.5-7B, which is on the smaller end, and the poisoning succeeded at temperature=0.1 where the model is most deterministic. Your point suggests that at higher temperatures or with denser, more complex documents, the attention budget gets consumed faster and contradiction detection degrades further. That would imply the 10% residual I measured at optimal conditions is a lower bound, not a typical case.

The "thinking" capability observation is interesting. I haven't tested a reasoning model against this attack pattern. The hypothesis would be that an explicit reasoning step forces the model to surface the contradiction between the legitimate $24.7M figure and the "corrected" $8.3M before committing to an answer. That seems worth testing.

On chain of custody: this connects to the provenance metadata discussion elsewhere in this thread. The most actionable version might be surfacing document metadata directly in the prompt context so the model's reasoning step has something concrete to work with, not just competing content.


Replies

ineedasernameyesterday at 6:22 PM

>That seems worth testing

I have-- I see your info via your HN profile. If I have a spare moment this weekend I'll reach out there, I'll dig up a few examples and take screenshots. I built an exploration tool for investigating a few things I was interested in, and surfacing potential reasoning paths exhibited in the tokens not chosen was one of them.

Part of my background is in Linguistics-- classical not just NLP/comp-- so the pragmatics involved with disfluencies made that "wait..." pattern stand out during just normal interactions with LLM's that showed thought traces. I'd see it not too infrequently eg by expanding the "thinking..." in various LLM chat interfaces.

In humans it's not a disfluency in the typical sense of difficulty with speech production, it's a pragmatic marker, let's the listener know a person is reevaluating something they were about to say. It of course carries over into writing, either in written dialog or less formal self-editing contexts, so it's well represented in any training corpora. As such, being a marker of "rethinking", it stood to reason models' "thinking" modes displayed it-- not unlikely it's specifically trained for.

So it's one of the things I went token-diving to see "close up", so to speak, in non-thinking models too. It's not hard to induce a reversal or at least diversion off whatever it would have said-- if close to a correct answer there's a reasonable chance it will get the correct one instead of pursuing a more likely of the top k. This wasn't with Qwen, it was gemma 3 1b where I did that particular exploration. It wasn't a systematic process I was doing for a study, but I found it pretty much any time I went looking-- I'd spot a decision point and perform the token injection.

If I have the time I'll mockup a simple RAG scenario, just inject the documents that would be retrieved from RAG result similar to your article, and screenshot that in particular. A bit of a toy setup but close enough to "live" that it could point the direction towards more refined testing, however the model responds, and putting aside the publishing side of these sorts of explorations there's a lot of practical value in assisting with debugging the error rates.