logoalt Hacker News

dmz73today at 4:08 AM1 replyview on HN

UTF8 is a horrible design. The only reason it was widely adopted was backwards compatibility with ASCII. There are large number of invalid byte combinations that have to be discarded. Parsing forward is complex even before taking invalid byte combinations in account and parsing backwards is even worse. Compare that to UTF16 where parsing forward and backwards are simpler than UTF8 and if there is invalid surrogate combination, one can assume it is valid UCS2 char.


Replies

moefhtoday at 4:38 AM

UTF-16 is an abomination. It's only easy to parse because it's artificially limited to 1 or 2 code units. It's an ugly hack that requires reserving 2048 code points ("surrogates") from the Unicode table just for the encoding itself.

It's also the reason why Unicode has a limit of about 1.1 million code points: without UTF-16, we could have over 2 billion (which is the UTF-8 limit).