Oh boy. Someone didn't get the memo that for LLMs, tokens are units of thinking . I.e. wh...

TeMPOraL • today at 10:18 AM • 28 replies • view on HN

Oh boy. Someone didn't get the memo that for LLMs, tokens are units of thinking. I.e. whatever feat of computation needs to happen to produce results you seek, it needs to fit in the tokens the LLM produces. Being a finite system, there's only so much computation the LLM internal structure can do per token, so the more you force the model to be concise, the more difficult the task becomes for it - worst case, you can guarantee not to get a good answer because it requires more computation than possible with the tokens produced.

I.e. by demanding the model to be concise, you're literally making it dumber.

(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)

Replies

jstummbillig • today at 11:03 AM

What do you mean? The page explicitly states:

> cutting ~75% of tokens while keeping full technical accuracy.

I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.

An explanation that explains nothing is not very interesting.

➕ show 4 replies

dTal • today at 3:31 PM

Yeah but not all tokens are created equal. Some tokens are hard to predict and thus encode useful information; some are highly predictable and therefore don't. Spending an entire forward pass through the token-generation machine just to generate a very low-entropy token like "is" is wasteful. The LLM doesn't get to "remember" that thinking, it just gets to see a trivial grammar-filling token that a very dumb LLM could just as easily have made. They aren't stenographically hiding useful computation state in words like "the" and "and".

➕ show 3 replies

vova_hn2 • today at 11:20 AM

Yeah, I don't think that "I'd be happy to help you with that" or "Sure, let me take a look at that for you" carries much useful signal that can be used for the next tokens.

➕ show 3 replies

andy99 • today at 1:14 PM

I’ve heard this, I don’t automatically believe it nor do I understand why it would need to be true, I’m still caught on the old fashioned idea that the only “thinking” for autoregressive modes happens during training.

But I assume this has been studied? Can anyone point to papers that show it? I’d particularly like to know what the curves look like, it’s clearly not linear, so if you cut out 75% or tokens what do you expect to lose?

I do imagine there is not a lot of caveman speak in the training data so results may be worse because they don’t fit the same patterns that have been reinforcement learned in.

➕ show 2 replies

kubb • today at 10:46 AM

This is condescending and wrong at the same time (best combo).

LLMs do stumble into long prediction chains that don’t lead the inference in any useful direction, wasting tokens and compute.

➕ show 1 reply

NiloCK • today at 10:37 AM

I agree with this take in general, but I think we need to be prepared for nuance when thinking about these things.

Tokens are how an LLM works things out, but I think it's just as likely as not that LLMs (like people) are capable of overthinking things to the point of coming to a wrong answer when their "gut" response would have been better. I do not content that this is the default mode, but that it is both possible, and that it's more or less likely on one kind of problem than another, problem categories to be determined.

A specific example of this was the era of chat interfaces that leaned too far in the direction of web search when responding to user queries. No, claude, I don't want a recipe blogspam link or summary - just listen to your heart and tell me how to mix pancakes.

More abstractly: LLMs give the running context window a lot of credit, and will work hard to post-hoc rationalize whatever is in there, including any prior low-likelihood tokens. I expect many problematic 'hallucinations' are the result of an unlucky run of two or more low probability tokens running together, and the likelihood of that happening in a given response scales ~linearly with the length of response.

➕ show 1 reply

avaer • today at 10:40 AM

That was my first thought too -- instead of talk like a caveman you could turn off reasoning, with probably better results.

Additionally, LLMs do not actually operate in text; much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.

So unless the LLM was trained otherwise, making it talk like a caveman is more than just theoretically turning it into a caveman.

➕ show 3 replies

strogonoff • today at 2:03 PM

A fundamental (but sadly common) error behind “tokens are units of thinking” is antropomorphising the model as a thinking being. That’s a pretty wild claim that requires a lot of proof, and possibly solving the hard problem, before it can be taken seriously.

There’s a less magical model of how LLMs work: they are essentially fancy autocomplete engines.

Most of us probably have an intuition that the more you give an autocomplete, the better results it will yield. However, does this extend to output of the autocomplete—i.e. the more tokens it uses for the result, the better?

It could well be true in context of chain of thought[0] models, in the sense that the output of a preceding autocomplete step is then fed as input to the next autocomplete step, and therefore would yield better results in the end. In other words, with this intuition, if caveman speak is applied early enough in the chain, it would indeed hamper the quality of the end result; and if it is applied later, it would not really save that many tokens.

Willing to be corrected by someone more familiar with NN architecture, of course.

[0] I can see “thinking” used as a term of art, distinct from its regular meaning, when discussing “chain of thought” models; sort of like what “learning” is in “machine learning”.

➕ show 1 reply

HarHarVeryFunny • today at 1:37 PM

That's going to depend on what model you're using with Claude Code. All of the more recent Anthropic models (4.5 and 4.6) support thinking, so the number of tokens generated ("units of thought") isn't directly tied to the verbosity of input and non-thought output.

However, another potential issue is that LLMs are continuation engines, and I'd have thought that talking like a caveman may be "interpreted" as meaning you want a dumbed down response, not just a smart response in caveman-speak.

It's a bit like asking an LLM to predict next move in a chess game - it's not going to predict the best move that it can, but rather predict the next move that would be played given what it can infer about the ELO rating of the player whose moves it is continuing. If you ask it to continue the move sequence of a poor player, it'll generate a poor move since that's the best prediction.

Of course there's not going to be a lot of caveman speak on stack overflow, so who knows what the impact is. Program go boom. Me stomp on bugs.

pxc • today at 2:51 PM

If this is true, shouldn't LLMs perform way worse when working in Chinese than in English? Seems like an easy thing to study since there are so many Chinese LLMs that can work in both Cbinese and English.

Do LLMs generally perform better in verbose languages than they do in concise ones?

➕ show 1 reply

baq • today at 10:21 AM

Do you know of evals with default Claude vs caveman Claude vs politician Claude solving the same tasks? Hypothesis is plausible, but I wouldn’t take it for granted

marginalia_nu • today at 1:21 PM

I wonder if a language like Latin would be useful.

It's a significantly much succinct semantic encoding than English while being able to express all the same concepts, since it encodes a lot of glue words into the grammar of the language, and conventionally lets you drop many pronouns.

e.g.

"I would have walked home, but it seemed like it was going to rain" (14 words) -> "Domum ambulavissem, sed pluiturum esse videbatur" (6 words).

➕ show 2 replies

zozbot234 • today at 12:05 PM

Grug says you quite right, token unit thinking, but empty words not real thinking and should avoid. Instead must think problem step by step with good impactful words.

raincole • today at 10:43 AM

When it comes to LLM you really cannot draw conclusions from first principles like this. Yes, it sounds reasonable. And things in reality aren't always reasonable.

Benchmark or nothing.

➕ show 1 reply

hackerInnen • today at 12:44 PM

You are absolutely right! That is exactly the reason why more lines of code always produce a better program. Straight on, m8!

➕ show 1 reply

andai • today at 10:20 AM

I remember a while back they found that replacing reasoning tokens with placeholders ("....") also boosted results on benchies.

But does talk like caveman make number go down? Less token = less think?

I also wondered, due to the way LLMs work, if I ask AI a question using fancy language, does that make it pattern match to scientific literature, and therefore increase the probability that the output will be true?

afro88 • today at 11:19 AM

IIUC this doesn't make the LLM think in caveman (thinking tokens). It just makes the final output show in caveman.

Demiurg082 • today at 2:40 PM

CoT token are usually controled via 'extended thinking' or 'adapted thinking'. CoT tokens are usually not affected by the system prompt. There is an effort parameter, though, which states to have an effect on accuracy for over all token consumption.

https://platform.claude.com/docs/en/build-with-claude/extend...

➕ show 1 reply

xgulfie • today at 1:14 PM

Ah so obviously making the LLM repeat itself three times for every response it will get smarter

agumonkey • today at 10:40 AM

How do we know if a token sits at an abstract level or just the textual level ?

PufPufPuf • today at 12:17 PM

You mention thinking tokens as a side note, but their existence invalidates your whole point. Virtually all modern LLMs use thinking tokens.

cyanydeez • today at 10:57 AM

It's not "units of thinking" its "units of reference"; as long as what it produces references the necessary probabilistic algorithms, itll do just fine.

otabdeveloper4 • today at 11:36 AM

LLMs don't think at all.

Forcing it to be concise doesn't work because it wasn't trained on token strings that short.

➕ show 2 replies

kogold • today at 11:46 AM

[flagged]

➕ show 10 replies

taneq • today at 2:11 PM

More concise is dumber. Got it.

Rexxar • today at 10:36 AM

  > Someone didn't get the memo that for LLMs, tokens are units of thinking.

Where do you get this memo ? Seems completely wrong to me. More computation does not translate to more "thinking" if you compute the wrong things (ie things that contribute significantly to the final sentence meaning).

➕ show 1 reply

alt Hacker News

Replies