> How much performance can a human still squeeze out at the assembly level versus today’s state of the art compilers?
Most of the squeezing is to be had in the parts where the compiler can’t help. (Which I guess is logically equivalent to saying that you can’t often do meaningfully better than the compiler on the things that the compiler is concerned with, but you have to admit it reads very differently.) Two important widely-applicable examples are data layout (locality, in particular getting rid of large and costly-to-traverse pointers) and vectorization; what they have in common is that you may well have to redesign the entire flow of data in your program around the issue before you get meaningful improvements. (And there is often an order-of-magnitude improvement to be had on a CPU-bound task, if you are willing to spend the time and effort to optimize.)
There are also specific situations where the approaches used by modern compilers work badly. The straightforward switch-based interpreter is a well-known example: modern Clang essentially turns into Clippy and goes “looks like you’re writing an interpreter, would you like me to duplicate your dispatch for you” so branch prediction works out as well as in manual assembly, but it still allocates registers a function at a time, so when the function in question is the entirety of the interpreter including the slowpaths, the regalloc sucks. Tail-call interpreters and __attribute__((cold, noinline, preserve_most)) amount to expressing the exact same control-flow graph in such a way that the compiler can digest it better, ironically by understanding less of it at any given time. This is one way that the dumb fundamental nature of the admittedly quite smart modern compiler shines through.
And in very tight loops there are still places when doing things by hand can help. For instance, when computing a histogram of byte values over a large block (for which I’m not aware of any public vectorized code that would go faster than the best scalar options) I’ve seen Clang lose as much as 20% to (contemporary) GCC on the best C implementation[1] or its straightforward manual translation to assembly, because Clang had decided it knew better which order the instructions should go in. As a less exotic case, I’ve seen GCC lose out by about 20% to (contemporary) Clang in vectorized loops because it had decided that having half the loop body be MOVs (or rather VMOVDQAs) would be a better idea than taking advantage of AVX’s ability to not overwrite either of the input arguments, and though MOVs are basically free on a superscalar they’re not that free. I’ve even seen both GCC and Clang ignore an explicit __builtin_expect() and compile a very predictable (but unavoidable) inner-loop branch into a CMOV, once again costing me about 20% in performance.
So if you do in fact care about the difference between 1.1 cycles/byte and 1.3 cycles/byte, yes you can beat a compiler even on a micro level. You just probably don’t have the, depending on your point of view, fortune or misfortune of working on code like that.