I think more effort should have been made to live with 65,536 characters. My understanding is that c...

rowls66 • yesterday at 7:44 PM • 6 replies • view on HN

I think more effort should have been made to live with 65,536 characters. My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis. I think that adding emojis to unicode is going to be seen a big mistake. We already have enough network bandwith to just send raster graphics for images in most cases. Cluttering the unicode codespace with emojis is pointless.

Replies

jasonwatkinspdx • yesterday at 9:21 PM

You are mistaken. Chinese Hanzi and the languages that derive from or incorporate them require way more than 65,536 code points. In particular a lot of these characters are formal family or place names. USC-2 failed because it couldn't represent these, and people using these languages justifiably objected to having to change how their family name is written to suit computers, vs computers handling it properly.

This "two bytes should be enough" mistake was one of the biggest blind spots in Unicode's original design, and is cited as an example of how standards groups can have cultural blind spots.

➕ show 1 reply

gred • yesterday at 8:03 PM

> My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis

This week's Unicode 17 announcement [1] mentions that of the ~160k existing codepoints, over 100k are CJK codepoints, so I don't think this can be true...

[1] https://blog.unicode.org/2025/09/unicode-170-release-announc...

duskwuff • yesterday at 8:01 PM

Your understanding is incorrect; a substantial number of the ranges allocated outside BMP (i.e. above U+FFFF) are used for CJK ideographs which are uncommon, but still in use, particularly in names and/or historical texts.

mort96 • yesterday at 7:57 PM

The silly thing is, lots of emoji these days aren't even a single code point. So many emoji these days are two other code points combined with a zero width joiner. Surely we could've introduced one code point which says "the next code point represents an emoji from a separate emoji set"?

daneel_w • yesterday at 8:29 PM

I entirely agree that we could've cared better for the leading 16 bit space. But protocol-wise adding a second component (images) to the concept of textual strings would've been a terrible choice.

The grande crime was that we squandered the space we were given by placing emojis outside the UTF-8 specification, where we already had a whooping 1.1 million code points at our disposal.

➕ show 1 reply

dudeinjapan • yesterday at 7:53 PM

CJK unification (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs) i.e. combining "almost same" Chinese/Japanese/Korean characters into the same codepoint, was done for this reason, and we are now living with the consequence that we need to load separate Traditional/Simplified Chinese, Japanese, and Korean fonts to render each language. Total PITA for apps that are multi-lingual.

➕ show 1 reply

alt Hacker News

Replies