Banning "length" from the codebase and splitting the concept into count vs size is one of ...

jp1016 • yesterday at 2:22 PM • 5 replies • view on HN

Banning "length" from the codebase and splitting the concept into count vs size is one of those things that sounds pedantic until you've spent an hour debugging an off-by-one in serialization code where someone mixed up "number of elements" and "number of bytes." After that you become a true believer.

The big-endian naming convention (source_index, target_index instead of index_source, index_target) is also interesting. It means related variables sort together lexicographically, which helps with grep and IDE autocomplete. Small thing but it adds up when you're reading unfamiliar code.

One thing I'd add: this convention is especially valuable during code review. When every variable that represents a byte quantity ends in _size and every item count ends in _count, a reviewer can spot dimensional mismatches almost mechanically without having to load the full algorithm into their head.

Replies

Shish2k • yesterday at 6:50 PM

> When every variable that represents a byte quantity ends in _size and every item count ends in _count, a reviewer can spot dimensional mismatches almost mechanically

At that point I'd rather make them separate data types, and have the compiler spot mismatches actually-mechanically o.o

jkaptur • yesterday at 7:39 PM

Canonical essay on this sort of technique: https://www.joelonsoftware.com/2005/05/11/making-wrong-code-...

layer8 • yesterday at 6:32 PM

> big-endian naming

I would call it “English naming” [0], it’s just more readable to start with, in an anglophone environment.

[0] as opposed to “naming, English”, I suppose ;)

zahlman • yesterday at 7:56 PM

I've always understood "length" to mean what the author calls "count", and would never expect it to refer to byte size; as far as I can tell, it never did. Size is a design-time consideration; caring about it in the code is an exceptional case, for applications like (as you mention) serialization. So that's what deserves the dedicated term. "Length" refers specifically to a total number of elements in many languages preceding Rust.

For that matter, many languages, especially "object-oriented" ones, treat heterogeneous containers as the default. They might not even offer native containers that can store everything inline in a single contiguous allocation, except perhaps for strings. In which case, "number of bytes" is itself ambiguous; are you including the indirected objects or not?

"Count" is also overloaded — it commonly means, and I normally only understand it to mean, the number of elements in a collection meeting some condition. Hence the `.count` method of Python sequences, as well as the jargon "population count" referring to the number of set bits in an integer. Today, Python's integers have both a `.bit_count` and a `.bit_length`, and it's obvious what both of them do; calling either `.bit_size` would be confusing in my mental framework, and a contradiction in terms in the OP's.

I would disagree that even C's `strlen` refers to byte size. C comes from a pre-Unicode world; the type is called `char` because that was naively considered sufficient at the time to represent a text character. (Unicode is still in that sense naive, but it at least allows for systems that are acutely aware of the distinction between "characters" and graphemes.) But notice: C's "strings" aren't proper objects; they're null-terminated sequences, i.e. their length is signaled in-band. So that metadata is also just part of the data, in a single allocation with no indirection; the "size" of a string could only reasonably be interpreted to include that null terminator. Yet the result of `strlen` excludes it! Further, if `strlen` is used on a string that was placed within some allocated buffer, it knows nothing about that buffer.

(Similarly, Rust `str::len` is properly named by this scheme. It gives the number of valid 1-byte-sized elements in a collection, not the byte size of the buffer they're stored within. It's still ambiguous in a sense, but that's because of the convention of using UTF-8 to create an abstraction of "character" elements of non-uniform size. This kind of ambiguity is properly resolved either with iterators, like the `Chars` iterator in Rust, or with views.)

Also consider: C has a `sizeof` operator, influencing Python's `.__sizeof__()` methods. That's because the concept of "size" equally makes sense for non-sequences; neither "count" nor "length" does. So of course "length" cannot mean what the author calls "size".

maleldil • yesterday at 2:44 PM

Big-endian naming is great. I've adopted it since I first read it about it in matklad's blog.

➕ show 1 reply

alt Hacker News

Replies