logoalt Hacker News

mort96today at 10:29 AM2 repliesview on HN

UTF-8 does not encode "European glyphs" in two bytes, no. Most European languages use variations of the latin alphabet, meaning most glyphs in European languages use the 1-byte ASCII subset of UTF-8. The occasional non-ASCII glyph becomes two bytes, that's correct, but that's a much smaller bloat than what you imply.

Anyway, what are you comparing it to, what is your preferred alternative? Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information and you can't mix languages in a text file? Or do you prefer using UTF-16, where all of ASCII is 2 bytes per character but you get a marginal benefit for Han texts?


Replies

otabdeveloper4today at 2:30 PM

Unicode could have just been encoded statefuly with a "current code page" mark byte.

With UTF and emojis we can't have random access to characters anyways, so why not go the whole way?

show 2 replies
thaumasiotestoday at 10:33 AM

> Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information?

Yes. Note that this is already how Unicode is supposed to work. See e.g. https://en.wikipedia.org/wiki/Byte_order_mark .

A file isn't meaningful unless you know how to interpret it; that will always be true. Assuming that all files must be in a preexisting format defeats the purpose of having file formats.

> Most European languages use variations of the latin alphabet

If you want to interpret "variations of Latin" really, really loosely, that's true.

Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters. This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.

show 3 replies