So many parser combinators operate on bytes assuming ASCII input only. I'd be more interested i...

zombot • today at 9:52 AM • 4 replies • view on HN

So many parser combinators operate on bytes assuming ASCII input only. I'd be more interested in a parser combinator lib that has UTF-8 decoding already abstracted away, operating on `wchar_t`, or even polymorphic input stream element types.

Replies

lokeg • today at 10:12 AM

Isn't working with the utf8 stream sufficient? Especially if you only have ASCII keywords/operators/brackets, I feel a ASCII parser should work with utf8 out of the box

➕ show 1 reply

Joker_vD • today at 11:33 AM

I'd rather not. Most of the time, you don't need it, and when you do, it's for a very small part of the input. And `wchar_t` is an abomination (it's UTF-32 on Linux, UTF-16 on Windows, and all of that is allowed); you probably really want `char32_t`, and again, not for the whole of the input; streaming such data a single rune/codepoint at a time is probably fine as well for most uses.

On the other hand, if your parser combinators process char-by-char, then maintaining a small "is this valid UTF-8 so far" context on the side should be pretty simple, so providing it would be an useful option, but actually decoding? Please don't.

eska • today at 2:20 PM

I’d still use a byte slice for that. Some formats may mix encodings, or have a text header and binary payload. For those cases one would need to use memchr for the first byte, then compare the remaining few bytes. So I don’t think it would be a huge performance impact

RossBencina • today at 1:22 PM

I'm not familiar with parser combinators. The parser generators that I'm familiar with (YACC, ANTLR3,5) parse a stream of lexemes/tokens, not characters. Is there a reason why combinators don't operate on lexemes?

➕ show 2 replies

alt Hacker News

Replies