logoalt Hacker News

3pt14159yesterday at 7:00 PM3 repliesview on HN

I remember a time before UTF-8's ubiquity. It was such a headache moving to i18z. I love UTF-8.


Replies

linguaeyesterday at 7:07 PM

I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).

UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!

show 2 replies
acdhatoday at 1:04 AM

I worked on a site in the late 90s which had news in several Asian languages, including both simplified and traditional Chinese. We had a partner in Hong Kong sending articles and being a stereotypical monolingual American I took them at their word that they were sending us simplified Chinese and had it loaded into our PHP app which dutifully served it with that encoding. It was clearly Chinese so I figured we had that feed working.

A couple of days later, I got an email from someone explaining that it was gibberish — apparently our content partner who claimed to be sending GB2312 simplified Chinese was in fact sending us Big5 traditional Chinese so while many of the byte values mapped to valid characters it was nonsensical.

glxxyzyesterday at 7:01 PM

I worked on an email client. Many many character set headaches.