Article title should be "Efficient C++ Programming for Modern 64-bit CPUs...".
A CPU implementing C++ as a microarchitecture…? Finally, uncontrovertible proof of the prophesy. We really are living in a Cthulhu nightmare.
Simulation theory is dead.
That title got me:
Modern C++ CPUs as in LISP CPUs or as in Verilog CPUs?
Came here to say exactly that.
Many cost relationships from TFA have already been more or less true for the 32-bit CPUs launched after 1990 and they all became true for the 32-bit high-end CPUs launched after 2000 (like Intel Pentium 4 and AMD Athlon XP), when the difference between the CPU clock frequency and the DRAM latency became almost as high as today.
Only for the 32-bit CPUs used in microcontrollers, which may have clock frequencies under 100 MHz and which may lack a cache hierarchy, the cost differences between many kinds of operations may collapse.
For instance even for not too old 32-bit CPUs it is right to classify the instructions in the following groups, based on their cost in clock cycles:
1. Simple integer operations with operands in registers
2. Loads from the L1 cache memory and simple floating-point operations, like addition and multiplication
3. Loads from the L2 cache memory, division (integer or floating-point), square root and mispredicted branches
4. Loads from the L3 cache memory and atomic read-modify-write operations (like atomic exchange, atomic fetch-and-add, atomic compare-and-swap)
5. Loads from the main memory
This classification matches the chart from TFA.