logoalt Hacker News

rswailtoday at 8:01 AM5 repliesview on HN

Things that have bugged me for 40 years...

* NUL terminated strings (and now, non UTF-8 encoded strings on input/output)

* Using LF or CR or CRLF as line terminators, and pipe/comma-delimited fields when there were other unambiguous ASCII characters that could have been used (eg, GS, FS, RS) that would have made the encoding/decoding of line termination an I/O thing keeping HT/VT/CR/LF/FF as literally print related codes.


Replies

EvanAndersontoday at 11:04 AM

I did a project to translate data framed in the ASCII field/record separator characters and it was gloriously easy. All the ugly escaping considerations with comma-delimited data went away and it became much easier.

show 1 reply
brewmarchetoday at 2:39 PM

Now with Unicode we actually have even more:

NL Next line (from EBCDIC?)

LS Line separator (invented by Unicode)

PS Paragraph separator (same)

The Unicode standard says that in addition to CR, LF, CRLF and the above, vertical tabs and form feeds should also be treated as line separators.

flohofwoetoday at 10:36 AM

> non UTF-8 encoded strings on input/output

UTF-8 on stdin/stdout works perfectly fine (unless you are on Windows of course, which is stuck in in the early 90s when it comes to international text encoding).

> Using LF or CR or CRLF as line terminators

This is also an operating system convention, and it would be better if programming languages wouldn't try to "guess" the correct line endings, since this causes more problems than it solves - but again, this is mostly a Windows specific problem, and it's Microsoft's job to finally bring Windows into the current century.

show 2 replies
Parodpertoday at 11:37 AM

LF makes the most sense, but they're all fine for text files. The issue is that CSV isn't text.

Last time I had to handle CSV files in bash, I converted them internally to RS and FS.

show 1 reply
codedokodetoday at 12:10 PM

> non UTF-8 encoded strings on input/output

I would just use UTF-8 everywhere.