UTF-8 is indeed a genius design. But of course it’s crucially dependent on the decision for ASCII to...

twoodfin • yesterday at 7:37 PM • 10 replies • view on HN

UTF-8 is indeed a genius design. But of course it’s crucially dependent on the decision for ASCII to use only 7 bits, which even in 1963 was kind of an odd choice.

Was this just historical luck? Is there a world where the designers of ASCII grabbed one more bit of code space for some nice-to-haves, or did they have code pages or other extensibility in mind from the start? I bet someone around here knows.

Replies

mort96 • yesterday at 7:43 PM

I don't know if this is the reason or if the causality goes the other way, but: it's worth noting that we didn't always have 8 general purpose bits. 7 bits + 1 parity bit or flag bit or something else was really common (enough so that e-mail to this day still uses quoted-printable [1] to encode octets with 7-bit bytes). A communication channel being able to transmit all 8 bits in a byte unchanged is called being 8-bit clean [2], and wasn't always a given.

In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte...

[1] https://en.wikipedia.org/wiki/Quoted-printable

[2] https://en.wikipedia.org/wiki/8-bit_clean

➕ show 1 reply

jasonwatkinspdx • yesterday at 8:13 PM

Not an expert but I happened to read about some of the history of this a while back.

ASCII has its roots in teletype codes, which were a development from telegraph codes like Morse.

Morse code is variable length, so this made automatic telegraph machines or teletypes awkward to implement. The solution was the 5 bit Baudot code. Using a fixed length code simplified the devices. Operators could type Baudot code using one hand on a 5 key keyboard. Part of the code's design was to minimize operator fatigue.

Baudot code is why we refer to the symbol rate of modems and the like in Baud btw.

Anyhow, the next change came with instead of telegraph machines directly signaling on the wire, instead a typewriter was used to create a punched tape of codepoints, which would be loaded into the telegraph machine for transmission. Since the keyboard was now decoupled from the wire code, there was more flexibility to add additional code points. This is where stuff like "Carriage Return" and "Line Feed" originate. This got standardized by Western Union and internationally.

By the time we get to ASCII, teleprinters are common, and the early computer industry adopted punched cards pervasively as an input format. And they initially did the straightforward thing of just using the telegraph codes. But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.

So zooming out here the story is that we started with binary codes, then adopted new schemes as technology developed. All this happened long before the digital computing world settled on 8 bit bytes as a convention. ASCII as bytes is just a practical compromise between the older teletype codes and the newer convention.

➕ show 1 reply

cryptonector • today at 5:17 AM

Fun fact: ASCII was a variable length encoding. No really! It was designed so that one could use overstrike to implement accents and umlauts, and also underline (which still works like that in terminals). I.e., á would be written a BS ' (or ' BS a), à would be written as a BS ` (or ` BS a), ö would be written o BS ", ø would be written as o BS /, ¢ would be written as c BS |, and so on and on. The typefaces were designed to make this possible.

This lives on in compose key sequences, so instead of a BS ' one types compose-' a and so on.

And this all predates ASCII: it's how people did accents and such on typewriters.

This is also why Spanish used to not use accents on capitals, and still allows capitals to not have accents: that would require smaller capitals, but typewriters back then didn't have them.

layer8 • yesterday at 8:39 PM

The use of 8-bit extensions of ASCII (like the ISO 8859-x family) was ubiquitous for a few decades, and arguably still is to some extent on Windows (the standard Windows code pages). If ASCII had been 8-bit from the start, but with the most common characters all within the first 128 integers, which would seem likely as a design, then UTF-8 would still have worked out pretty well.

The accident of history is less that ASCII happens to be 7 bits, but that the relevant phase of computer development happened to primarily occur in an English-speaking country, and that English text happens to be well representable with 7-bit units.

➕ show 1 reply

toast0 • yesterday at 8:13 PM

7 bits isn't that odd. Bauddot was 5 bits, and found insufficient, so 6 bit codes were developed; they were found insufficient, so 7-bit ASCII was developed.

IBM had standardized 8-bit bytes on their System/360, so they developed the 8-bit EBCDIC encoding. Other computing vendors didn't have consistent byte lengths... 7-bits was weird, but characters didn't necessarily fit nicely into system words anyway.

colejohnson66 • yesterday at 7:42 PM

The idea was that the free bit would be repurposed, likely for parity.

➕ show 3 replies

spydum • yesterday at 8:08 PM

https://www.sensitiveresearch.com/Archive/CharCodeHist/X3.4-...

Looks to me like serendipity - they thought 8 bits would be wasteful, they didnt have a need for that many characters.

KPGv2 • yesterday at 8:08 PM

Historical luck. Though "luck" is probably pushing it in the way one might say certain math proofs are historically "lucky" based on previous work. It's more an almost natural consequence.

Before ASCII there was BCDIC, which was six bits and non-standardized (there were variants, just like technically there are a number of ASCII variants, with the common just referred to as ASCII these days).

BCDIC was the capital English letters plus common punctuation plus numbers. 2^6 is 64, and for capital letters + numbers, you have 36, plus a few common punctuation marks puts you around 50. IIRC the original by IBM was around 45 or something. Slash, period, comma, tc.

So when there was a decision to support lowercase, they added a bit because that's all that was necessary, and I think the printers around at the time couldn't print anything but something less than 128 characters anyway. There wasn't any ó or ö or anything printable, so why support it?

But eventually that yielded to 8-bit encodings (various ASCIIs like latin-1 extended, etc. that had ñ etc.).

Crucially, UTF-8 is only compatible with the 7-bit ASCII. All those 8-bit ASCIIs are incompatible with UTF-8 because they use the eighth bit.

michaelsshaw • yesterday at 8:09 PM

I'm not sure, but it does seem like a great bit of historical foresight. It stands as a lesson to anyone standardizing something: wanna use a 32 bit integer? Make it 31 bits. Just in case. Obviously, this isn't always applicable (e.g. sizes, etc..), but the idea of leaving even the smallest amount of space for future extensibility is crucial.

alt Hacker News

Replies