Looks good! There's an important thing missing from the benchmarks though:
- cpu usage under concurrency: many of these spin-lock or use atomics, which can use up to 100% cpu time just spinning.
- latency under concurrency: atomics cause cache-line bouncing which kills latency, especially p99 latency