So many parser combinators operate on bytes assuming ASCII input only. I'd be more interested in a parser combinator lib that has UTF-8 decoding already abstracted away, operating on `wchar_t`, or even polymorphic input stream element types.
I'd rather not. Most of the time, you don't need it, and when you do, it's for a very small part of the input. And `wchar_t` is an abomination (it's UTF-32 on Linux, UTF-16 on Windows, and all of that is allowed); you probably really want `char32_t`, and again, not for the whole of the input; streaming such data a single rune/codepoint at a time is probably fine as well for most uses.
On the other hand, if your parser combinators process char-by-char, then maintaining a small "is this valid UTF-8 so far" context on the side should be pretty simple, so providing it would be an useful option, but actually decoding? Please don't.
I’d still use a byte slice for that. Some formats may mix encodings, or have a text header and binary payload. For those cases one would need to use memchr for the first byte, then compare the remaining few bytes. So I don’t think it would be a huge performance impact
I'm not familiar with parser combinators. The parser generators that I'm familiar with (YACC, ANTLR3,5) parse a stream of lexemes/tokens, not characters. Is there a reason why combinators don't operate on lexemes?
Isn't working with the utf8 stream sufficient? Especially if you only have ASCII keywords/operators/brackets, I feel a ASCII parser should work with utf8 out of the box