> It is not OK to stop at the first sign of trouble, and return whatever maybe is right. “123timm...

derefr • yesterday at 7:41 PM • 0 replies • view on HN

> It is not OK to stop at the first sign of trouble, and return whatever maybe is right. “123timmy” is not a number, nor is the empty string.

None of the C functions referenced (atol, strtol, sscanf) are number-parsing functions per se. Rather, they're numeric-lexeme scanning+extraction functions.

These functions are all designed to avoid making any assumptions about the syntax of the larger document the numeric lexeme might be embedded in. You might, after all, be using a syntax where numbers can come with units on the end. Or you might be reading numbers as comma-separated values.

And, as a key point the author might be missing: C, in being co-designed with UNIX, offers primitives tuned for the context of:

- writing UNIX CLI tools that work with unbounded streams of input (i.e. piped output from other UNIX CLI tools),

- where, crucially, the stream is just text, and so carries no TLV-esque framing protocol to tell you the definitive length of a thing;

- and nor (especially in early memory-constrained systems) are you able to perform allocations of heap memory in order to employ an unbounded growable buffer for retaining the current lexeme until you do reach the end of it (which, if you could, would let you use a scanner state-machine that doubles as a parser/validator, returning either a parsed value or an error)

- but instead, to deal with the 1. unbounded input, 2. of textual encoding, 3. in constant memory, you must eagerly scan the input stream (i.e. synchronously reduce over each received byte, or at most each fixed-length N-byte chunk using a static or stack-allocated fixed-length buffer, discarding the original string bytes once reduced-over) to produce lexically-decoded (but not parsed/validated) lexemes; and then do this again, on a higher level, feeding your stream of lexemes into a fixed-sized sum-typed ring-buffer (i.e. an array-of-union-typed-lexeme-struct-type-entries), where you can then invoke a function that attempts to scan over + consume them (but unlike the original stream-parsing function, doesn't consume the buffer unless successful, and so isn't functioning as a scanner per se, but rather as an LR parser.)

If you're not writing UNIX CLI tools, direct use of the C-stdlib numeric-lexeme scan functions is operating on the wrong abstraction layer. What you want, if you have pre-framed strings that are "either valid numbers or parse errors", is to implement an actual parsing function... that can then invoke these numeric-lexer functions to do the majority of its work.

And if you're writing C, and yet you're not in UNIX-pipeline unbounded-text-stream land, but rather are parsing well-defined bounded-length "documents" (like, say, C source files)... then you probably want to use a real lexer-generator (like flex) to feed a parser-generator (like yacc/bison). Where:

- you'd validate the token in context, in the parsing phaase;

- and your lexing rules would make certain classes of input invalid at lexing time. (E.g. you can write your lexeme matching rules such that multi-digit numbers with leading zeroes, or floating-point values with no digits before/after the decimal place, simply aren't "numbers" from your lexer's perspective.)

...which means that, once again, you can "get away with" invokeing the regular C numeric-lexeme scanner functions; i.e. `yylval = atoi(yytext);` in bison terms. (And you'd want to, since doing so saves memory vs. keeping the numbers around as strings.)

alt Hacker News