Author here. This post is a write-up of a performance-debugging rabbit hole I hit while trying to sa...

MrCroxx • last Tuesday at 2:28 PM • 1 reply • view on HN

Author here. This post is a write-up of a performance-debugging rabbit hole I hit while trying to saturate NICs with NVMe reads using io_uring and RDMA.

The short version: READ_FIXED fixed the obvious per-I/O GUP overhead in a small demo, but the larger deployment still got stuck at roughly half of line rate. After ruling out io-wq backlog, request splitting, fd lookup, and CRC arithmetic, the actual wall turned out to be dTLB misses from scanning 1,028 KiB buffers backed by 4 KiB pages. Moving the read arena to hugepages brought the system close to NIC saturation.

The funny part is that an AI agent suggested hugepages early and got the optimization right, but its explanation was wrong. This post is mostly about reconstructing the evidence for why it worked.

I’d be very interested in feedback from people who have used AI to debug performance issues in a complex system.

Replies

ozgrakkurt • today at 3:27 PM

I disagree with the AI part. Because hugepages is one of the things that can be guessed to improve performance when doing something with substantial amount of data.

So anyone familiar with the space could have suggested something like that without knowing the details of the problem. Hence it is not useful advice IMO.

That aside, the blog post was really cool to read and a instant favorite, wish there were more english posts on the blog.

Especially like the hardware limit based expectations, detailed measurements and the writing style.

alt Hacker News

Replies