One of the first homework assignments when I learned C back in '83 was after a long lecture on how the string functions are fundamentally broken, and the class introduction to writing C was fixing all of them.
Why not look at how other languages attack this? e.g. how does "42".parse() work in rust?
Edit: https://doc.rust-lang.org/src/core/num/mod.rs.html#1537
interesting! It boils down to this
pub const fn from_ascii_radix(src: &[u8], radix: u32) -> Result<u32, ParseIntError> {
use self::IntErrorKind::*;
use self::ParseIntError as PIE;
// guard: radix must be 2..=36
if 2 > radix || radix > 36 {
from_ascii_radix_panic(radix);
}
if src.is_empty() {
return Err(PIE { kind: Empty });
}
// Strip leading '+' or '-', detect sign
// (a bare '+' or '-' with nothing after it is an error)
// accumulate digits, checking for overflow
Ok(result)
}I remember an old project that ran into something like this. I think we just used atoi() or similar and the error check was a string comparison between the original input and a sprintf() of the converted value.
Ugly (and not performant if in a hot path) but it works.
I thought it was pretty well known that everything related to strings in C stdlib (including all str... functions) is bad. You just need to bring in your own string library.
Cant you just:
for(int i = 0; i < len(characters); i++)
{
if(characters[i]-48 <= 9 && characters[i]-48 >= 0)
{
ret = ret * 10 + characters[i] - 48;
}
else
{
return ERROR;
}
}
return ret;
Adjust until it actually works, but you get the picture.This is not a hard thing to do without using a library. The code below is easily adapted to the unsigned case and/or arbitrary base rather than 10.
#include <stdio.h>
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "usage: require one numeric argument");
}
char *nump = argv[1];
unsigned neg = 0;
unsigned long long ures = 0;
if (*nump == '-') {
neg = 1;
nump = nump + 1;
}
if (!*nump) {
fprintf(stderr, "require non empty string\n");
return 1;
}
char b;
while (b = *nump++) {
if (b >= '0' && b <= '9') {
unsigned long long nres = (ures * 10) + (b - '0');
if (nres < ures) {
fprintf(stderr, "overflow in '%s'\n", argv[1]);
return 1;
}
ures = nres;
} else {
if (b >= ' ') {
fprintf(stderr, "invalid char '%c' in '%s'\n", b, argv[1]);
} else {
fprintf(stderr, "invalid byte '%d' in '%s'\n", b, argv[1]);
}
return 1;
}
}
long long res = (long long) ures;
if (neg) {
if (ures <= 0x8000000000000000ULL) {
res = -res;
} else {
fprintf(stderr, "underflow in '%s'\n", argv[1]);
return 1;
}
} else if (ures > 0x7FFFFFFFFFFFFFFFULL) {
fprintf(stderr, "overflow in '%s'\n", argv[1]);
return 1;
}
fprintf(stdout, "result: %lld\n", res);
return 0;
}One of the great virtues of C is that this sort of thing is not part of the language ...
Another case many integer parsing functions get wrong is that they interpret a leading 0 as an octal indicator.
That should be opt-in via a flag, if it needs to be supported at all. Unix file permissions are the only deliberate use of octal I've ever seen.
The problem is that float parsing is highly non-trivial if you want it to be correct for all edge cases.
For integers, you're faster (in both development time and runtime) to write your own parser than to try and assemble the pieces in this pile of shit into a half-working one.
C++17 from_chars excluded. Incidentally, 2022 seems about right for the year that ONE open source implementation finally actually implemented the float part of that. Or was it more like 2024?
Can't you regex that given string contains just numbers and then use any of the provided methods? Then check if the returning value is a number to cater for edge cases
Ok, having a method to do that for you would be nice, but the post reads like it's an issue that std library doesn't provide you with a method behaving as you exactly want
And yet, thousands and thousands of 'C' programs parse integers every hour successfully.
Perhaps the right title should be "No way to parse pathological edge cases in 'C'"
And then see how other languages do.
There's no one correct way to parse integers. Do you want to support 0x prefixes? Is a leading zero an indicator or octal, a zero-padded decimal, or a syntax error? Are you willing to accept a leading "+"? Are leading whitespaces OK? Trailing ones? Is 0x0c a whitespace? What about all the weird Unicode ones? Do you allow exponential notation (1e1)? Etc, etc.
In every language, the standard library makes some assumptions about this. In JavaScript, an empty string parses to zero.
The standard C library, which dates back to the stone age, does the simplest thing you can do without range checking, because, well, that's kinda the C paradigm. If you want parsing that handles edge cases in a specific way, you do it yourself. It's just digits.
> It is not OK to stop at the first sign of trouble, and return whatever maybe is right. “123timmy” is not a number, nor is the empty string.
None of the C functions referenced (atol, strtol, sscanf) are number-parsing functions per se. Rather, they're numeric-lexeme scanning+extraction functions.
These functions are all designed to avoid making any assumptions about the syntax of the larger document the numeric lexeme might be embedded in. You might, after all, be using a syntax where numbers can come with units on the end. Or you might be reading numbers as comma-separated values.
And, as a key point the author might be missing: C, in being co-designed with UNIX, offers primitives tuned for the context of:
- writing UNIX CLI tools that work with unbounded streams of input (i.e. piped output from other UNIX CLI tools),
- where, crucially, the stream is just text, and so carries no TLV-esque framing protocol to tell you the definitive length of a thing;
- and nor (especially in early memory-constrained systems) are you able to perform allocations of heap memory in order to employ an unbounded growable buffer for retaining the current lexeme until you do reach the end of it (which, if you could, would let you use a scanner state-machine that doubles as a parser/validator, returning either a parsed value or an error)
- but instead, to deal with the 1. unbounded input, 2. of textual encoding, 3. in constant memory, you must eagerly scan the input stream (i.e. synchronously reduce over each received byte, or at most each fixed-length N-byte chunk using a static or stack-allocated fixed-length buffer, discarding the original string bytes once reduced-over) to produce lexically-decoded (but not parsed/validated) lexemes; and then do this again, on a higher level, feeding your stream of lexemes into a fixed-sized sum-typed ring-buffer (i.e. an array-of-union-typed-lexeme-struct-type-entries), where you can then invoke a function that attempts to scan over + consume them (but unlike the original stream-parsing function, doesn't consume the buffer unless successful, and so isn't functioning as a scanner per se, but rather as an LR parser.)
If you're not writing UNIX CLI tools, direct use of the C-stdlib numeric-lexeme scan functions is operating on the wrong abstraction layer. What you want, if you have pre-framed strings that are "either valid numbers or parse errors", is to implement an actual parsing function... that can then invoke these numeric-lexer functions to do the majority of its work.
And if you're writing C, and yet you're not in UNIX-pipeline unbounded-text-stream land, but rather are parsing well-defined bounded-length "documents" (like, say, C source files)... then you probably want to use a real lexer-generator (like flex) to feed a parser-generator (like yacc/bison). Where:
- you'd validate the token in context, in the parsing phaase;
- and your lexing rules would make certain classes of input invalid at lexing time. (E.g. you can write your lexeme matching rules such that multi-digit numbers with leading zeroes, or floating-point values with no digits before/after the decimal place, simply aren't "numbers" from your lexer's perspective.)
...which means that, once again, you can "get away with" invokeing the regular C numeric-lexeme scanner functions; i.e. `yylval = atoi(yytext);` in bison terms. (And you'd want to, since doing so saves memory vs. keeping the numbers around as strings.)
... say users of only language with no way to parse integers.
:)
As a C programmer, I find this kind of bad faith article very irritating.
Yes, the standard library is bad. This is by far the worst part of the C legacy. But it is not that hard to write your own.
String functions like this are not difficult at all, and you can use better naming and semantics, write faster code etc.
C is not the C standard library, ffs.
I wasn't in this class myself, but one prof at my alma mater started his "Programming 201" class with the simplest assignment: write a C program that accepts two integers from the user and prints their sum. It actually was the only assignment for the rest of the semester, since he has a test suite that would humiliate the students gently at first, but would ultimately pipe a billion nines into stdin as the first argument.