logoalt Hacker News

simonasktoday at 8:34 AM0 repliesview on HN

All of Europe outside of the UK and Enligh-speaking Ireland need characters outside of ASCII, but most letters are ASCII. For example, the string "blåbærgrød" in Danish (blueberry porridge) has about the densest occurrence of non-ASCII characters, but that's still only 30%. It takes 13 bytes in UTF-8, but 20 bytes in UTF-16.

Spanish has generally at most one accented vowel (á, ó, ü, é, ...) per word, and generally at most one ñ per word. German rarely has more than two umlauts per word, and almost never more than one ß.

UTF-16 is a wild pessimization for European languages, and UTF-8 is only slightly wasteful in Asian languages.