logoalt Hacker News

kouteiheikatoday at 7:26 AM1 replyview on HN

> This is almost certainly wrong.

So how would you explain the increase in token usage, considering the fact that conventionally tokenizers are trained to minimize the token usage within a given vocabulary budget?

> Putting an inductive bias in your tokenizer seems just a terrible idea.

You're already effectively doing this by the sheer fact of using a BPE tokenizer, and especially with modern BPE-based LLM tokenizers[1]. I agree trying to bake this manually in a tokenizer is most likely not a good idea, but I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account.

[1] Example from Qwen3.6 tokenizer:

    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      }
    ]
  },

Replies

nltoday at 7:42 AM

> So how would you explain the increase in token usage, considering the fact that conventionally tokenizers are trained to minimize the token usage within a given vocabulary budget?

Just modeling whitespace as its own token would seem to explain the increase.

> Qwen3.6 tokenizer: "pretokenizer"

That's the pre-tokenizer, not the tokenizer. That is mostly a performance optimization that lets the memory requirements for the BPE tokenizer be a lot less.

> I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account.

The reason everyone went to BPE was because it was so dramatically better than morphology based tokenizers. See the BPE paper: https://arxiv.org/abs/1508.07909

BPE already learns morphology because it sees the raw bytes.

show 1 reply