I'm somewhat dubious about anything talking about low level performance programming at the instruction level that doesn't distinguish between latency and throughput, never mind mention the incredibly out-of-order nature of modern desktop/server class CPU cores.
Article title should be "Efficient C++ Programming for Modern 64-bit CPUs...".
See also a 3-part article; Advanced C++ Optimization Techniques for High-Performance Applications here - https://news.ycombinator.com/item?id=48265690
Virtual functions cost a lot less here than I expected.
C++ CPUs?
If people are interested in this stuff, this is the house style guide that I've ended up with in mid 2026, its great-great-great grandparents were at Google, which informed Greg Badros and Mark Rabkin and Andrei Alexandrescu when they did the one at FB, which informed a bunch of trading work, which informed a bunch of GPU work.
It's opinionated but it has served me well.
https://gist.github.com/b7r6/5dde648f5dc1dea1e9039f2211f5d40...
This looks like something that every serious C++ programmer should be reading.
what if a language would allow to elegantly pack Optional values?
so the physical layout has a bit vector with one bit for each optional. and a popcnt over that bitvector (masked up to the value we're interested in) will give the actual slot to look into?
would also make sense to reorder / bucket fields by (byte) size
if you want to do that in any low level language (rust, c++) you have to deviate from their standard syntax for optionals, and you have to manually keep track of slot order. but for domains with many optional/default values, this amy really reduce cache pressure, no?
In higher level languages you can fake the effect (with flyweight facades), so from python such a packed "dataclass"-like class can look neat and clean. however at the low level there is no abstraction that allows to create your own data layout.
at least I didn't find anything yet.