> By having a contiguous array of indices to look at, that array can be prefetched as it goes
Does x86 64 actually do this data dependent single deref prefetech? Because in that case I have a some design assumptions I have to reevaluate.
Hardware definitely supports this but it might need compiler support, as in adding instructions to do prefetching. Which might be done automatically or requires a pragma or calling a builtin. So it can be implemented in any case.
On modern cpus? Most likely. Those kinds of optimizations are done by the core with no compiler magic needed.
CPU implementation has become too complex to grasp. The only sure way to know how a CPU will behave for a given workload is to run the workload. It's good to have some basic expectations of performance, instructions/cycle, memory bandwidth, to detect if something is off. I guess I'm trying to say it's hard to keep in your head all the details of what ~1B transistors are doing together to run your code. It's just too big.