Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls, especi...

foltik • today at 3:28 AM • 6 replies • view on HN

Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls, especially avoiding them via reverse engineering the channel layout! Ran it on my 9950X3D machine with dual-channel DDR5 and saw clear spikes from 70ns to 330ns every 15us or so.

The hedging technique is a cool demo too, but I’m not sure it’s practical.

At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.

I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.

Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.

Replies

Lramseyer • today at 7:23 AM

Another point about HFT - They're mostly using FPGAs (some use custom silicon) which means that they have much tighter control over how DRAM is accessed and how the memory controller is configured. They could implement this in hardware if they really need to, but it wouldn't be at the OS level.

strongpigeon • today at 2:54 PM

> At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.

That’s my main hang up as well. On one hand this is undeniably cool work, but on the other, efficient cache usage is how you maximize throughput.

This optimizes for (narrow) tail latency, but I do wonder at what performance cost. I would be super interested in hearing about real world use cases.

➕ show 1 reply

josephg • today at 9:25 AM

It could be massively improved with a special CPU instruction for racing dram reads. That might make it actually useful for real applications. As it is, the threading model she used here would make it incredibly difficult to use this in a real program.

➕ show 2 replies

zozbot234 • today at 11:16 AM

> clear spikes from 70ns to 330ns

Isn't that rather trivial though as a source of tail latency? There's much worse spikes coming from other sources, e.g. power management states within the CPU and possibly other hardware. At the end of the day, this is why simple microcontrollers are still preferred for hard RT workloads. This work doesn't change that in any way.

➕ show 1 reply

formerly_proven • today at 7:09 AM

On most RAM tREF can be increased a lot from the default, at least if kept somewhat cool.

jeffbee • today at 3:08 PM

It is not only not practical, it is a completely useless technique. I got downvoted to negative infinity for mentioning this, but I guess I am the only person who actually read the benchmark. The reason the technique "works" in the benchmark is that all the threads run free and just record their timestamps. The winner is decided post hoc. This behavior is utterly pointless for real systems. In a real system you need to decide the winner online, which means the winner needs to signal somehow that it has won, and suppress the side effects of the losers, a multi-core coordination problem that wipes out most of the benefit of the tail improvement but, more importantly, also massively worsens the median latency.

➕ show 2 replies

alt Hacker News

Replies