logoalt Hacker News

kingstnapyesterday at 7:29 PM0 repliesview on HN

Runs of spaces of many different lengths are encoded as a single token. Its not actually inefficient.

In fact everything from ' ' to ' '79 all have a single token assigned to them on the OpenAI GPT4 tokenizer. Sometimes ' 'x + '\n' is also assigned a single token.

You might ask why they do this but its to make it so programming work better by reducing token counts. All whitespace before the code gets jammed into a single token and entire empty lines also get turned into a single token.

There are actually lots of interesting hand crafted token features added which don't get discussed much.