Character-density and token-efficiency are different things. Latter is data and, therefore, tokenize...

arjie • yesterday at 8:54 PM • 0 replies • view on HN

Character-density and token-efficiency are different things. Latter is data and, therefore, tokenizer specific e.g. take GPT-5's tokenizer o200k_base and run mandarin text and its translation through. Some amount of the time en will beat zh. I just tested with news articles and wikipedia.

After all `def func():` is only 3 tokens on o200k_base.

alt Hacker News