Did you even try to verify your claims. I tested it on few translations on wikipedia articles using ...

YetAnotherNick • today at 6:13 AM • 2 replies • view on HN

Did you even try to verify your claims. I tested it on few translations on wikipedia articles using [1] and it takes 15-20% more tokens for Norwegian.

English performs the best because there is more data in English and high quality sources are either only in English or there is a good translation in English.

[1]: https://platform.openai.com/tokenizer

Replies

numpad0 • today at 11:44 AM

Tokenizer efficiency varying by languages, by as much as up to 15x, is very well known and established

  https://www.google.com/search?q=tokenizer+efficiency+by+language

tecleandor • today at 11:16 AM

Tests I've done with NO and FI texts, for the same number of characters, with the GPT5 tokenizer I get around 2x the tokens than EN. With the older tokenizers it's more like 2x or even 3x.

alt Hacker News

Replies