logoalt Hacker News

Everything in C is undefined behavior

468 pointsby lycopodiopsidatoday at 6:07 AM612 commentsview on HN

Comments

muvlontoday at 8:33 AM

Yes there is tons of surprising and weird UB in C, but this article doesn't do a great job of showcasing it. It barely scratches the surface.

Here's a way weirder example:

  volatile int x = 5;
  printf("%d in hex is 0x%x.\n", x, x);
This is totally fine if x is just an int, but the volatile makes it UB. Why? 5.1.2.4.1 says any volatile access - including just reading it - is a side effect. 6.5.1.2 says that unsequenced side effects on the same scalar object (in this case, x) are UB. 6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.

So in common parlance, a "data race" is any concurrent accesses to the same object from different threads, at least one of which is a write. In C, we can have a data race on a single thread and without any writes!

show 17 replies
beeforporktoday at 8:22 AM

The UB in unaligned pointers is even worse: an unaligned pointer in itself is UB, not only an access to it. So even implicit casting a void*v to an int*i (like 'i=v' in C or 'f(v)' when f() accepts an int*) is UB if the cast pointer is not aligned to int.

It is important to understand that this is a C level problem: if you have UB in your C program, then your C program is broken, i.e., it is formally invalid and wrong, because it is against the C language spec. UB is not on the HW, it has nothing to do with crashes or faults. That cast from void* to int* most likely corresponds to no code on the HW at all -- types are in C only, not on the HW, so a cast is a reinterpretation at C level -- and no HW will crash on that cast (because there is not even code for it). You may think that an integer value in a register must be fine, right? No, because it's not about pointers actually being integers in registers on your HW, but your C program is broken by definition if the cast pointer is unaligned.

show 6 replies
quelsolaartoday at 7:36 AM

The 5 stages of learning about UB in C:

-Denial: "I know what signed overflow does on my machine."

-Anger: "This compiler is trash! why doesn't it just do what I say!?"

-Bargaining: "I'm submitting this proposal to wg14 to fix C..."

-Depression: "Can you rely on C code for anything?"

-Acceptance: "Just dont write UB."

show 7 replies
greyspheretoday at 7:24 AM

The examples aren't really undefined behavior. They are examples that could become UB based on input/circumstances. Which if you are going to be that generous, every function call is UB because it could exceed stack space. Which is basically true in any language (up to the equivalent def of UB in that language). I feel like c has enough actual rough edges that deserve attention that sensationalism like this muddies folks attention (particularly novices) and can end up doing more harm than good.

show 5 replies
bestoufftoday at 6:46 AM

The problem of UB is not really that it may crash in some architecture. The real problem is that the compiler expects UB code to NOT happen, so if you write UB code anyway the compiler (and especially the optimizer) is allowed to translate that to anything that's convenient for its happy path. And sometimes that "anything" can be really unexpected (like removing big chunks of code).

show 6 replies
parastitoday at 9:02 AM

I have never in my 20 years of writing C heard so much about undefined behavior as I have in the past 6 months on Hacker News. It has never entered the conversation. You write the code. If it doesn't work, you debug it and apply a fix or a workaround. Why does the idea of undefined behavior in C get to the front page so consistently?

show 15 replies
jb1991today at 10:30 AM

Some of the C++ code in this article has not been idiomatic in over a decade, and would be considered a code smell today. The language has evolved into quite a different language than when it was first created. As soon as I saw all of those raw pointers and direct pointer access, it was clear that at least part of this article should be taken with a grain of salt.

The other obvious issue with the overall perspective is that C and C++ are being thrown together directly as if somehow they’re nearly the same language, but they are really very far apart nowadays.

show 2 replies
pizlonatortoday at 5:11 PM

The problem is incorrectly assuming that the spec is meaningful in some kind of rigorous way.

It’s not. All that matters is what C compilers actually do and what real C programs expect.

This is a good thing. It creates a culture where the two sides meet each other where they’re at

show 1 reply
debugniktoday at 7:23 AM

As much as I agree with the intro, these examples aren't good and the overall article is just a veil for pushing LLM coding.

show 2 replies
maple3142today at 8:52 AM

Is this a correct understanding of UB in C? A program P has a set of inputs A that do not trigger UB, and a complementary set of inputs B that do trigger UB. A correct compiler compiles P into an executable P'. For all inputs in A, P' should behave the same as P. However, for any input in B, the is absolutely no requirements on the behavior of P'.

show 2 replies
rom1vtoday at 9:07 AM

A concrete example of undefined behavior caused by an unaligned pointer: https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on...

show 1 reply
hunterpaynetoday at 9:10 PM

What all these C programmers are pointing out is 2 fold:

- Making a Turing machine have deterministic and predictable results is hard.

- Modern hardware is complex and getting all hardware to behave the same way requires a strong mathematical abstraction.

C was never intended to be a fully defined mathematical abstraction. It was a language which was easy to write a compiler for. That's its original strength. Trying to make it something it isn't is the problem. Either choose a language which does have such abstractions or understand the drawbacks of the tool you are using.

Right tool for the right job.

psim1today at 6:54 PM

I like the ideas of this article but would not use SPARC as a main badguy in my examples. A naive and probably popular takeaway would be, "Thank goodness I am not writing for SPARC and don't need to worry about these SPARC architectural concerns!"

__0x01today at 6:51 AM

> A problem with this is that in order to confirm the findings, you’ll need an expert human. But generally expert humans are busy doing other things.

The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.

LLM generated code will eventually contain UB.

EDIT: added "eventually"

show 4 replies
rurbantoday at 7:42 AM

Very bad advice. Of course good new LLM's know about UB, but you still need to use ubsan (ie - fsanitize=undefined), and not your LLM.

show 1 reply
mjs01today at 9:02 AM

Integer promotion seems to be the source of many signed integer overflow UB. Why does C have it? Does integer promotion ever have a good part?

show 1 reply
casey2today at 11:32 PM

And that's a good thing. UB is another mechanism to speed up the development of compilers, many other languages fall trap to over defining while we lack the methods to solve such problems cleanly (believe me, the modern c++ people have tried). Usually this is the case because they believe strongly that their methods work despite evidence.

As for UB, the compiler has the final say. Nobody should write nontrivial c without understanding their compiler, the same as nobody should write c without understanding their text editor.

Code in other languages breaks between versions, in c there are projects with code from every version at once!

Looking at it another way, work put into a c compiler enables you to write nontrivial code.

JonChesterfieldtoday at 11:45 AM

Well, you can't write malloc in conforming C, which hurts rather more than remembering to write bitcast as memcpy on char pointers.

Doesn't matter though because you aren't writing standards conforming C. You're writing whatever dialect your compilers support, and that's probably (module bugs) much better behaved than the spec suggests.

Or you're writing C++ and way more exposed to the adversarial-and-benevolent compiler experience.

The type aliasing rules are the only ones that routinely cause me much annoyance in C and there's always a workaround, whether if it's the launder intrinsic used to implement C++, the may_alias attribute or in extremis dropping into asm. So they're a nuisance not a blocker.

bkallustoday at 3:45 PM

> the OpenBSD project has not been very receptive in the past for bug reports, my sense of “this is probably fine, in practice”, and that if OpenBSD wants to weed out UB from their code base, then that’s a major project that should be done in a better way than me just being the middle man between the LLM and them for a patch here and there.

Part of the reason for all the UB in OpenBSD is that UBSan doesn't run on that platform. When I ported OpenBSD's httpd to Linux, I found that UBSan tripped before the server even came up because the config flag parsing shifts into the MSB of a signed integer.

I tried to contribute back a patch (just make the flag bitfield unsigned), but it was ignored. I think if UBSan ran natively on OpenBSD, then there would be a lot more of these patches, and the maintainers would have to take an official stance on whether they think these bugs matter.

weinzierltoday at 6:42 AM

A fun one that'd fit list be sequence point violations like

    i = i++
show 3 replies
commandlinefantoday at 6:32 PM

A lot of this stems from trying to insist that char just means "small" and not "8 bits" and that int means "bigger than that" and not "32 bits". In fairness, K&R dealt with an era where 9 bit architectures existed, but char is 8 bits now. Everywhere.

show 1 reply
codeflotoday at 8:50 AM

> The compiler, and really the underlying hardware too, is playing a game of telephone with your UB intentions.

The part about hardware is wrong BTW. In all the cases about null pointers and out-of-bounds access and integer overflow and whatnot, the hardware semantics are clearly defined, and the assembler code does exactly what is written. The way modern compilers act on your code makes C less safe than assembler in that sense.

show 1 reply
lelanthrantoday at 9:10 AM

I read through this in detail... Is it just me, or are these things that are invoked by intentionally bypassing the typing?

I mean, you have to go out of your way and use a cast to get the UB in the first example.

For the `isxdigit` implementation, using a parameter to index into an array without a length check is pretty suspect already. I don't think any of my code actually indexes an array without checking the length in some way.

For the float -> int conversion, converting a float to an int without picking a conversion does not make sense in the first place - math.h has rounding and ceiling functions.

> For all you know the compiler has no internal way to even express your intention here.

I'm human, not a compiler, and even I cannot tell what the intention is behind trying to call NULL as a function. What exactly is expected to happen?

> Because the argument needs to be a pointer, and the NULL macro may be misinterpreted as an integer zero.

I don't think this is true for C. The NULL macro is defined to be a pointer in the C standard, AFAIK. Just because comparisons with zero are allowed, does not imply that the standard implicitly promotes NULL to `int`.

I think only the final one is of note (the 24-bit shift assigned to a uint64_t).

show 1 reply
amiga386today at 12:05 PM

Can anyone explain why this is undefined behaviour? UBSan calls it "indirect call of a function through a function pointer of the wrong type"

    struct foo {int i;};
    int func(struct foo *x) {return x->i;}
    int main() {
        int (*funcptr)(void*) = (int (*)(void*)) &func;
        struct foo foo = { 42 };
        return funcptr(&foo);
    }
While this is all kosher per the language lawyers:

    struct foo {int i;};
    int func(void *x) {return ((struct foo *)x)->i;}
    int main() {
        int (*funcptr)(void*) = &func;
        struct foo foo = { 42 };
        return funcptr(&foo);
    }
show 5 replies
akiarietoday at 8:30 AM

C is still, by far, the simplest language that we have.

Although many newer languages are safer (with the exclusion of Rust, primarily by being slower) the same kinds of issues that are there in C are there in these languages, their effects are just harder to see.

People complain about C as though they know how to fix it.

show 3 replies
tomcamtoday at 3:38 PM

I fear I will be downvoted into oblivion but I also want to learn from this.

First let me state the case for C. It’s meant to be used as a systems language that’s as close to assembly as possible while remaining portable (compared to assembly). As such it’s the first high-level language developed for any new processor.

Given the above predicate: Isn’t everything described in the article as it should be?

Add too much to the language and it becomes less possible to implement on new architectures, right? Because the undefined behavior lets implementors stand up new compilers fairly quickly.

For less undefined behavior isn’t it better to use languages that have that in their DNA? D, Zig, Go, Java, etc?

show 1 reply
wyldfiretoday at 11:42 AM

Maybe we should criminalize writing articles about Undefined Behavior that have a "So what do we do now?" subheader but omit any mention of UBSan.

keyletoday at 9:46 AM

When talking UB, putting C and C++ in the same basket is basically like comparing drunk driving a car and riding a bicycle sober... Both means of transport, very different experience.

sltrtoday at 12:57 PM

For a deep dive on UB with printf, see https://srs.fyi/see-conversions/

> When programming in C, to avoid unexpected pitfalls, one must be acutely aware of a whole slew of implicit behaviors (some of which are implementation-defined or even undefined).

danborn26today at 11:06 AM

The scariest part is how many production systems rely on undefined behavior without anyone knowing until a compiler update breaks everything.

show 1 reply
1vuio0pswjnm7today at 1:30 PM

"My point is that ALL nontrivial C and C++ code has UB."

Is "nontrivial" defined

How would one identify "nontrivial" C code

Is there an objective measure (defined)

Or is it a matter of personal opinion that could vary from person to person (undefined)

bvrmntoday at 9:16 AM

I really like Zig's approach to UB. Especially alignment is a part of type. And all this wordy builtins for conversions. Starring to it makes you think what you doing wrong with data model it requires now 3 lines of casting expression.

elnatrotoday at 10:30 AM

Is there a way to avoid undefined behavior Im C then? Could we write a new C compiler that adds some checks and fixes (e.g. raise documented exceptions) to each undefined behavior?

show 3 replies
kajaktumtoday at 1:58 PM

I want a language that is a group of bit (0,1) and the xor operator. Everything else is built on top of that.

fjfaasetoday at 9:19 AM

Is comparing a signed integer with an unsigned integer UB? I resently wrote some code and compiled it with gcc to x86_64 (without optimization) that returned an incorrect answer.

show 2 replies
0x20cowboytoday at 7:43 PM

Life is undefined behaviour.

veltastoday at 6:43 AM

From the ANSI C standard:

  3.16 undefined behavior: Behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements.  Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message).
Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph? The intent here is extremely clear, that undefined behavior means you're doing something not intended or specified by the language, but that the consequence of this should be somewhat bounded or as expected for the target machine. This is closer to our old school understanding of UB.

By 'bounded', this obviously ignores the security consequences of e.g. buffer overflows, but just because UB can be exploited doesn't mean it's appropriate for e.g. the compiler to exploit it too, that clearly violates the intent of this paragraph.

show 4 replies
raluktoday at 6:49 AM

In C / C++ there are two kinds of undefined behaviour. One is where there is written in standard what UB is. Another one is everthing else that is not in standard.

show 2 replies
y42today at 10:02 AM

shameless plug, it's part of the Nerd Encyclopedia: it's also called "nasal demons".

https://nickyreinert.de/2023/2023-05-16-nerd-enzyklop%C3%A4d...

QuiEgotoday at 1:57 PM

C does not abstract differences in underlying hardware well. Systems programmers know if they have an architecture that can't handle unaligned accesses or that the address they are doing load/stores from is a mmio register. Systems programmers know the difference between a virtual address and a physical address and have debugged MPU faults or MMU table walks and page faults more times than they want to think about.

C is horrible for trying to write a portable user-mode program in 2026. There are lots of better options.

C is great for writing low-level system code where you need to optimize performance down to the last cycle. It not abstracting away the hardware is super important for some use cases. A classic example is all of the platform-specific flavors of memcpy in the Linux kernel that are C/assembly hybrids hand-optimized for the SIMD pipelines of some CPUs.

C is a tool, Rust is a tool, Java is a tool, Python is a tool. Use the right tool for the job ¯\_(ツ)_/¯.

el_pollo_diablotoday at 3:59 PM

> probably meaning on an address that’s a multiple of sizeof(int), but who knows

Sigh. s/sizeof(int)/_Alignof(int)/.

There are good reasons for an implementation to have sizeof(int) = _Alignof(int) and not a mere multiple of it, but if you are going to discuss subtle points and UB, just stick to the language guarantees.

> But let’s say you have a modern machine, where NULL is a pointer to address zero, and you actually have an object there.

You don't program in C on such a machine. Or maybe memory is virtualized, and it does not matter that your object lives at physical address zero, as long as you can map a non-zero virtual address to it.

> So how do you print an uid_t?

    if ((uid_t)-1 < (uid_t)0) {
        // uid_t is signed
        printf("%" PRIdMAX, (intmax_t)id);
    } else {
        // uid_t is unsigned
        printf("%" PRIuMAX, (uintmax_t)id);
    }
> It’s not rare for the denominator to come from untrusted input.

It's not rare for the array index to come from untrusted input.

It's not rare for the supposedly valid UTF-8 string to come from untrusted input.

...

Why single out division? This problem affects every partially defined operation. In the case of division at least, everyone learned in school that thou shalt not divide by zero. Adding two untrusted integers and forgetting that signed overflow is UB, not defined as a modulo? Your average programmer is much less likely to see that coming.

    > unsigned char a = 0xff;
    > unsigned char b = 1;
    > unsigned char zero = 0;
    > bool overflowed = (a + b) == zero;
    >
    > unsigned char a = 0x80;
    > uint64_t b = a << 24;
Please. Convert your operands to wide enough types before the operation. Convert your results back to narrow enough types to compensate for integer promotion to wider types than you would have liked. Do that consistently, and you're good.

Here:

    unsigned char a = 0xff;
    unsigned char b = 1;
    unsigned char zero = 0;
    bool overflowed = (unsigned char)(a + b) == zero;

    unsigned char a = 0x80;
    uint64_t b = (uint32_t)a << 24;
justmarctoday at 9:58 AM

The art is actually making sure it all stays defined behavior

alpertoday at 9:35 AM

Isn't the article mostly saying that SPARC sucks?

saltyoldmantoday at 5:51 PM

Probably not "everything" the vast vast vast majority of everything you are looking at on your screen right now is written in C.

DostLeFantoday at 12:51 PM

Very interesting article. I'm in love with C++, and I cannot say that I'm a good developer, but interesting to discover where UB can be. (Sorry I'm not a good english speaker)

dmitrygrtoday at 6:41 AM

I stoped reading about here:

    > bool parse_packet(const uint8_t* bytes) {
    >   const int* magic_intp = (const int*)bytes;   // UB!
Author, if you are reading this, please cite the spec section explaining that this is UB. Dereferencing the produced pointer may be UB, but casting itself is not, since uint8_t is ~ char and char* can be cast to and from any type.

you might try to argue that uint8_t is not necessarily char, and while it is true that implementations of C can exist where CHAR_BIT > 8, but those do not have uint8_t defined (as per spec), so if you have uint8_t, then it is "unsigned char", which makes this cast perfectly safe and defined as far as i can tell. Of course CHAR_BIT is required to be >= 8, so if it is not >8, it is exactly 8. (In any case, whether uint8_t is literally a typedef of unsigned char is implementation-defined and not actually relevant to whether the cast itself is valid -- it is)

show 3 replies
up2isomorphismtoday at 1:41 PM

U just need to read the title and 5 lines to know this must be a rust guy.

stackedinsertertoday at 12:31 PM

How can it be valid implementation of isxdigit?

``` int isxdigit(int c) { if (c == EOF) { return false; } return some_array[c]; } ```

If you write code like this, then everything in programming is UB.

🔗 View 35 more comments