logoalt Hacker News

cyberaxyesterday at 7:08 PM3 repliesview on HN

UTF-8 is simply genius. It entirely obviated the need for clunky 2-byte encodings (and all the associated nonsense about byte order marks).

The only problem with UTF-8 is that Windows and Java were developed without knowledge about UTF-8 and ended up with 16-bit characters.

Oh yes, and Python 3 should have known better when it went through the string-bytes split.


Replies

wrsyesterday at 7:18 PM

UTF-16 made lots of sense at the time because Unicode thought "65,536 characters will be enough for anybody" and it retains the 1:1 relationship between string elements and characters that everyone had assumed for decades. I.e., you can treat a string as an array of characters and just index into it with an O(1) operation.

As Unicode (quickly) evolved, it turned out not that only are there WAY more than 65,000 characters, there's not even a 1:1 relationship between code points and characters, or even a single defined transformation between glyphs and code points, or even a simple relationship between glyphs and what's on the screen. So even UTF-32 isn't enough to let you act like it's 1980 and str[3] is the 4th "character" of a string.

So now we have very complex string APIs that reflect the actual complexity of how human language works...though lots of people (mostly English-speaking) still act like str[3] is the 4th "character" of a string.

UTF-8 was designed with the knowledge that there's no point in pretending that string indexing will work. Windows, MacOS, Java, JavaScript, etc. just missed the boat by a few years and went the wrong way.

show 1 reply
wongarsuyesterday at 7:29 PM

Yeah, Java and Windows NT3.1 had really bad timing. Both managed to include Unicode despite starting development before the Unicode 1.0 release, but both added unicode back when Unicode was 16 bit and the need for something like UTF-8 was less clear

KerrAvonyesterday at 10:50 PM

NeXTstep was also UTF-16 through OpenStep 4.0, IIRC. Apple was later able to fix this because the string abstraction in the standard library was complete enough no one actually needed to care about the internal representation, but the API still retains some of the UTF-16-specific weirdnesses.