logoalt Hacker News

YetAnotherNicktoday at 6:13 AM2 repliesview on HN

Did you even try to verify your claims. I tested it on few translations on wikipedia articles using [1] and it takes 15-20% more tokens for Norwegian.

English performs the best because there is more data in English and high quality sources are either only in English or there is a good translation in English.

[1]: https://platform.openai.com/tokenizer


Replies

numpad0today at 11:44 AM

Tokenizer efficiency varying by languages, by as much as up to 15x, is very well known and established

  https://www.google.com/search?q=tokenizer+efficiency+by+language
tecleandortoday at 11:16 AM

Tests I've done with NO and FI texts, for the same number of characters, with the GPT5 tokenizer I get around 2x the tokens than EN. With the older tokenizers it's more like 2x or even 3x.