Unsigned sizes: A five year mistake

91 points • by lerno • yesterday at 6:40 PM • 105 comments • view on HN

Comments

Signed quantities are a good default, and are easier to deal with when doing subtractions and mixing integers of different widths. (And integers includes pointers here, so it's very hard to not have different widths).

However unsigned integers are still very useful, I'd say essential, in low-level programming. For example when doing buffer management and memory allocation.

   - bitwise operations
   - modular arithmetic implemented with just ++, -- (ringbuffers, e.g TCP sequence numbers)
   - using the full range of a 8-bit, 16-bit, 32-bit datatype (quite common)
   - splitting a positive quantity into two smaller quantities, e.g. using a 16-bit index as 8-bit major index plus 8-bit minor index.

etc.

Don't forget that the signed vs unsigned integer is in some sense an artificial distinction. Machines have you put the distinction in the CPU instructions themselves, they don't track a "signed" property as part of values. And it can make sense to use the same value in different ways. However, C and many other languages decided to put a tag on the type, so operator syntax can be agnostic to signedness, and the compiler will choose the appropriate CPU instruction.

➕ show 1 reply

kevin_thibedeau • yesterday at 7:02 PM

Systems programmers love to hate on unsigned integers. Generations have been infected with the Java world model that integers have to be pretend number lines centered on zero. Guess what, you still have boundary conditions to deal with. There are times when you really really need to use the full word range without negative values. This happens more often with low level programming and machines with small word sizes, something fewer people are engaged in. It doesn't need to be the default. Ada has them sequestered as modular types but it's available to use when needed.

➕ show 5 replies

flohofwoe • today at 8:31 AM

Finally a language doing the right thing :)

My two ruls of thumb for C code are:

1. use signed integers for everything except bit-wise operations and modulo math (e.g. "almost always signed")

2. make implicit sign conversion an error via `-Werror -Wsign-conversion`

The problem with making sizes and indices unsigned (even if they can't be negative) is that you'd might to want to add negative offsets, and that either requires explicit casting in languages without explicit signed/unsigned conversion (e.g. additional hassle and reducing readability), or is a footgun area in languages with implicit sign conversion.

➕ show 1 reply

Groxx • yesterday at 7:20 PM

>If sizes are unsigned, like in C, C++, Rust and Zig – then it follows that anything involving indexing into data will need to either be all unsigned or require casts. With C’s loose semantics, the problem is largely swept under the rug, but for Rust it meant that you’d regularly need to cast back and forth when dealing with sizes.

TBH I've had very little struggle with this at all. As long as you keep your values and types separate, the unsigned type that you got a number from originally feeds just fine into the unsigned type that you send it to next. Needing casting then becomes a very clear sign that you're mixing sources and there be dragons, back up and fix the types or stop using the wrong variable. It's a low-cost early bug detector.

Implicitly casting between integer types though... yeah, that's an absolute freaking nightmare.

➕ show 1 reply

ncruces • today at 8:00 AM

It's not really signed vs unsigned that's the issue, IMO. It's (mostly, in C) undefined behavior and implicit conversions?

I'm not sure Go is saner just because len is an int. Well, maybe, depending on how you look at it. Defining len to be signed int, means the largest valid len is half your address space, which also means half of all possible indexes are always invalid; which makes some things easier.

But it's really that integer arithmetic is not undefined behavior regardless of signedness, that bounds are checked, and that even indexing your slice with an int64 on a 32-bit CPU does the full correct bounds check. In fact, you can use any integer type as an index.

Given all of the above, indexing with a uint or an int is actually indiferent. In that case, the bound check is a single unsigned <len compare (despite the fact that len is signed).

What's really painful, is trying to handle a full 32-bit address space with 32-bit addresses and sizes, like in Wasm; you need 33-bit math. So in a sense, limiting sizes to 31-bit (signed) does help. But at the language level, IMO, the rest matters more.

➕ show 1 reply

LegionMammal978 • yesterday at 7:14 PM

> But what about the range? While it’s true that you get twice the range, surprisingly often the code in the range above signed-int max is quite bug-ridden. Any code doing something like (2U * index) / 2U in this range will have quite the surprise coming.

Alas, (2S * signed_index) / 2S will similarly result in surprises the moment the signed_index hits half the signed-int max. There's no free lunch when trying to cheat the integer ranges.

➕ show 1 reply

ok123456 • yesterday at 8:18 PM

Bjarne agrees.

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p14...

➕ show 2 replies

deathanatos • yesterday at 8:38 PM

> The former is easier to define, but has the downside of essentially “silencing warnings”. Let’s say the code was originally written to cast an u16 to u32, but later the variable type changes from u16 to u64 and the cast is now actually silently truncating things. Here we have casts becoming a sort of “silence all warnings”.

Well … we even mention Rust in the paragraph right before this. In Rust, you can up a u16 to a u64 this way:

  let bigger: u32 = x.into();

  let bigger = u32::from(x);

The conversion `from` is infallible, because a u16 always fits in a u32. There is no `from(u64) -> u32`, because as the article notes, that would truncate, so if we did change the type to u64, the code would now fail to compile. (And we'd be forced to figure out what we want to do here.)

(There are fallible conversions, too, in the form of try_from, that can do u64 → u32, but will return an error if the conversion fails.)

Similarly, for,

  for (uint x = 10; x >= 0; x--) // Infinte loop!

This is why I think implicit wrapping is a bad idea in language design. Even Rust went down the wrong path (in my mind) there, and I think has worked back towards something safer in recent years. But Rust provides a decent example here too; this is pseudo-code:

  for (uint x = 10; x.is_some(); x = x.checked_sub(1))

Where `checked_sub` is returns `None` instead of wrapping, providing us a means to detect the stopping point. So, something like that. (Though you'd probably also want to destructure the option into the uint for use inside the loop.) Of course, higher-level stuff always wins out here, I think, and in Rust you wouldn't write the above; instead something like,

  for x in (0..=10).rev()

(And even then, if we need indexes; usually, one would prefer to iterate through a slice or something like that. The higher-level concept of iterators usually dispenses with most or all uses of indexes, and in the rare cases when needed, most languages provide something like `enumerate` to get them from the iterator.)

alberto-m • yesterday at 9:00 PM

I might be a contrarian in that I actually like using unsigned integers for sizes and indexes. In my experience, most of their trappings can be prevented by treating any subtraction involving them as a `reinterpret_cast`: i.e.

* Do your utmost to rewrite the code in order to avoid doing that (e.g. reordering disequations to transform subtractions into additions). * If not possible, think very hard about any possible edge case: you most certainly need an additional `if` to deal with those. * When analyzing other people's code during troubleshooting merge reviews, assume any formula involving an unsigned integer and a minus sign is wrong.

Validark • yesterday at 8:25 PM

I am personally moving in the opposite direction. I haven't meaningfully used a signed integer in years, and I see signed integers as being for more niche use-cases. I mainly only use a signed types when I want to do a "signed shift right". If there was a >>> operator in Zig I wouldn't even think of signed integers.

Given your examples, I think you'd have fewer issues if you were working with unsigned integers exclusively. Although I'm curious about what other code you were referencing with this: "But seeing how each change both made the code easier to reason about and more correct, I couldn’t deny the evidence."

With regards to modulo, in Zig if you try to use it with a signed integer it will tell you to specify whether you want `@mod` or `@rem` semantics. In my case, I'd almost never write `x % 2`, I'd write `x & 1`. I do use unsigned division but I'd pretty much never write code that would emit the `div` instruction.

I'm not saying you're wrong though! Everyone has a different mind. If you attain higher correctness and understandability through using signed integers, that's great. I'm just saying I'm in the opposite camp.

➕ show 1 reply

ks2048 • yesterday at 7:14 PM

I know language designers have a lot of trade-offs to consider... But I would say if you know a value will logically always be >= 0, better to have a type that reflects that.

The potential bugs listed would be prevented by, e.g. "x--" won't compile without explicitly supplying a case for x==0 OR by using some more verbose methods like "decrement_with_wrap".

The trade-off is lack of C-like concise code, but more safe and explicit.

➕ show 2 replies

rurban • today at 10:05 AM

So his compiler cannot detect the unsigned overflows and instead chooses to call it a user mistake!

Sizes and indices of course need to be unsigned, and any self respecting compiler should warn about dangerous usage.

ximm • yesterday at 7:19 PM

Is the text on this page really #bbbdc3 on #ffffff? How is anyone supposed to be able to read that?

➕ show 2 replies

EdSchouten • yesterday at 7:32 PM

> If sizes are unsigned, like in C, C++, Rust and Zig – then it follows that anything involving indexing into data will need to either be all unsigned or require casts.

I don’t really get this claim. Indexing should just look up the element corresponding to the value provided. It’s easy to come up with semantics that are intuitive and sound, even if signed integers or ones smaller than size_t are used.

➕ show 1 reply

larsnystrom • yesterday at 8:49 PM

I don’t understand how dealing with numbers correctly is not a solved problem in computer engineering by now.

➕ show 1 reply

cperciva • yesterday at 7:49 PM

I don't get it. Is this a parody of poor design decisions?

Sure, it's possible to write bugs in C. And if you really want to, you can disable the compiler warnings which flag tautologous comparisons and mixed-sign comparisons (a common reason for doing this is to avoid spurious warnings in generic-type code).

But, uhh, "people can deliberately write bugs" has got to be the weakest justification I've ever seen for changing a language feature -- especially one as fundamental as "sizes of objects can't be negative".

➕ show 1 reply

IshKebab • yesterday at 7:52 PM

It seems like they've identified common bugs patterns in C that would have been ameliorated by using signed, but come to the wrong conclusion that signed is the correct answer rather than that C is poorly designed for making the broken code the easy option.

Fix the language. Don't hack around it by using the wrong type.

➕ show 1 reply

jonstewart • yesterday at 7:25 PM

I hate using languages that only have signed integers. Using integers that can’t be negative fits many problems nicely and avoids the edge case of having to check for negative.

➕ show 2 replies

alt Hacker News

Unsigned sizes: A five year mistake

Comments