> A file isn't meaningful unless you know how to interpret it; that will always be true. T...

zzo38computer • yesterday at 6:55 PM • 1 reply • view on HN

> A file isn't meaningful unless you know how to interpret it; that will always be true.

There are multiple levels of meaning, though; character encoding is just one part of it. For example, a text file might be plain text, or HTML, or JSON, or a C source code, etc; a binary file might be DER, or IFF, or ZIP, etc; and then there will be e.g. what kind of data a JSON or DER or IFF contains and how that level of the data is interpreted, etc.

> Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters.

Whether or not they are identical to ASCII characters depends on the character set and on other things, such as what they are being used for; the definition of "identical" is not so simple as you make it seem. Unicode defines them as not identical, which is appropriate for some uses but is wrong for other uses. (Unicode also defines some characters as identical even though in some uses it would be more appropriate to treat them as not identical, too. So, Unicode is both ways bad.)

> This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.

I agree with that (although I think UTF-8 should not be used for Japanese either), but it isn't because of which characters are considered "identical" or not. There are problems with Unicode in general regardless of which encoding you use.

Replies

rmunn • yesterday at 11:22 PM

> ... (although I think UTF-8 should not be used for Japanese either) ...

The people putting up websites in Japanese disagree with you, it would seem. According to Wikipedia (in the Shift JIS article), as of March 2026 99% of websites in the .jp domain were in UTF-8, with only 1% being in Shift JIS.

Japan used to have two different encodings in common use, Shift JIS (usually used on Windows) and EUC-JP (more common on Unix servers). This resulted in characters being misinterpreted often enough that they coined the word mojibake to describe the phenomenon of text coming out completely garbled. These days, it seems Japanese website makers are more than happy to accept a slight inefficiency in encoding size, because what they gain from that is never having to see mojibake again.

alt Hacker News

Replies