I never understood why people want this in the first place. Sure, making this step more human explai...

sigmoid10 • today at 6:46 AM • 3 replies • view on HN

I never understood why people want this in the first place. Sure, making this step more human explainable would be nice and possibly even fix some very particular problems for particular languages, but it directly goes against the primary objective of a tokenizer: Optimizing sequence length vs. vocabulary size. This is a pretty clear and hard optimization target and the best you can do is make sure that your tokenizer training set more closely mimics your training and ultimately your inference data. Putting english or german grammar in there by force will only degrade every other language in the tokenzier, while we already know that limiting additional languages will hurt overall model performance. And the belief that you can encode a dataset of trillions of tokens into a more efficient vocabulary than a machine is kind of weird tbh. People have also accepted since the early convnet days that the best encoding representation for images in machine learning is not a human understandable one. Same goes for audio. So why should text be any different? If you really think so, you might also wanna have a go at feature engineering images. And it's not like people haven't tried that. But they all eventually learned their lesson.

Replies

wongarsu • today at 9:08 AM

We usually build the tokenizer by optimizing for one goal (space-efficient encoding of text), then use it in a model that is trained for an entirely different goal (producing good text, "reasoning", "coding", etc). It is not immediately clear that the optimization goal for the tokenizer is actually the one that best serves the training of the llm.

That's what all these attempts boil down to. They don't presume to be able to find a more space-efficient encoding by hand, they assume that the optimization goal for the tokenizer was wrong and they can do better by adding some extra rules. And this isn't entirely without precendent, most tokenizers have a couple of "forced" tokens that were not organically discovered. Moving around how digits are grouped in the tokenizer is another point where wins have been shown.

This is where projects like nanochat are really valuable for quickly and (relatively) cheaply trying out various tweaks

➕ show 1 reply

amelius • today at 11:58 AM

If you want to make it more human-explainable, then ditch the entire tokenizer and just feed the models raw characters. Because now there is nothing to explain.

➕ show 1 reply

rrr_oh_man • today at 8:02 AM

I feel it's a case of "This random word generator can't possibly be smarter than I?!"

alt Hacker News

Replies