DRAM has a design flaw from 1966. I bypassed it [video]

374 points • by surprisetalk • last Tuesday at 7:14 PM • 137 comments • view on HN

Related: Tailslayer: Library for reducing tail latency in RAM reads - https://news.ycombinator.com/item?id=47680023 - April 2026 (23 comments)

Comments

foltik • today at 3:28 AM

Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls, especially avoiding them via reverse engineering the channel layout! Ran it on my 9950X3D machine with dual-channel DDR5 and saw clear spikes from 70ns to 330ns every 15us or so.

The hedging technique is a cool demo too, but I’m not sure it’s practical.

At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.

I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.

Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.

➕ show 6 replies

tromp • today at 9:09 AM

A more accurate but less inspiring title would be:

RAM Has a Design Tradeoff from 1966. I made another one on top.

The first tradeoff, of 6x fewer transistors for some extra latency, is immensely beneficial. The second, of reducing some of that extra latency for extra copies of static data, is beneficial only to some extremely niche application. Still a very educational video about modern memory architecture.

[EDIT: accidental extra copy of this comment deleted]

➕ show 1 reply

kreelman • today at 2:56 AM

This is very much worth watching. It is a tour de force.

Laurie does an amazing job of reimagining Google's strange job optimisation technique (for jobs running on hard disk storage) that uses 2 CPUs to do the same job. The technique simply takes the result of the machine that finishes it first, discarding the slower job's results... It seems expensive in resources, but it works and allows high priority tasks to run optimally.

Laurie re-imagines this process but for RAM!! In doing this she needs to deal with Cores, RAM channels and other relatively undocumented CPU memory management features.

She was even able to work out various undocumented CPU/RAM settings by using her tool to find where timing differences exposed various CPU settings.

She's turned "Tailslayer" into a lib now, available on Github, https://github.com/LaurieWired/tailslayer

You can see her having so much fun, doing cool victory dances as she works out ways of getting around each of the issues that she finds.

The experimentation, explanation and graphing of results is fantastic. Amazing stuff. Perhaps someone will use this somewhere?

As mentioned in the YT comments, the work done here is probably a Master's degrees worth of work, experimentation and documentation.

Go Laurie!

➕ show 6 replies

mzajc • today at 2:43 AM

Previously: https://news.ycombinator.com/item?id=47680023

➕ show 1 reply

freedomben • today at 12:37 PM

LaurieWired is so incredibly smart, and so incredibly nerdy :-D

Really enjoyed this video, and I'm pretty picky. I learned a lot, even though I already know (or thought I knew) quite a bit about this subject as it was a particular interest of mine in Comp Sci school. I highly recommend. Skip forward through chunks of the train part though where she is messing around. It does get more informative later though so don't skip all of the train part

➕ show 1 reply

rkagerer • today at 4:32 AM

Halfway through this great video and I have two questions:

1) Can we take this library and turn it into a a generic driver or something that applies the technique to all software (kernel and userspace) running on the system? i.e. If I want to halve my effective memory in order to completely eliminate the tail latency problem, without having to rewrite legacy software to implement this invention.

2) What model miniature smoke machine is that? I instruct volunteer firefighters and occasionally do scale model demos to teach ventilation concepts. Some research years back led me to the "Tiny FX" fogger which works great, but it's expensive and this thing looks even more convenient.

➕ show 3 replies

boznz • today at 3:47 AM

Should say DRAM, SRAM does not have this.

➕ show 2 replies

yalogin • today at 8:00 AM

This is a cool idea, very well put through for everyone to understand such an esoteric concept.

However I wonder if the core idea itself is useful or not in practice. With modern memory there are two main aspects it makes worse. First is cost, it needs to double the memory used for the same compute. With memory costs already soaring this is not good. Then the other main issue of throughout, haven’t put enough thought into that yet but feels like it requires more orchestration and increases costs there too.

dwoldrich • today at 1:44 PM

Voxel Space[1] could have used this, would that multicore had been prevalent at the time. I recall being fascinated that simply facing the camera north or south would knock off 2fps from an already slow frame rate.

Many of our maps' routes would be laid out in a predominately east or west-facing track to max out our staying within cache lines as we marched our rays up the screen.

So, we needed as much main memory bandwidth as we could get. I remember experimenting with cache line warming to try to keep the memory controllers saturated with work with measurable success. But it would have been difficult in Voxel Space to predict which lines to warm (and when), so nothing came of it.

Tailslayer would have given us an edge by just splitting up the scene with multiprocessing and with a lot more RAM usage and without any other code. Alas, hardware like that was like 15 years in the future. Le sigh.

[1] https://en.wikipedia.org/wiki/Voxel_Space

➕ show 1 reply

sbiru93 • today at 8:19 AM

Doesn't doing this halve the computing power? I don't know this world at all, is that acceptable?

➕ show 1 reply

josalhor • today at 6:50 AM

I haven't had time to see the whole thing yet, but I'm quite surprised this yielded good results. If this works I would have expected CPU implementations to do some optimization around this by default given the memory latency bottleneck of the last 1.5 decades. What am I missing here?

➕ show 1 reply

bronlund • today at 7:02 AM

She could probably have been stinking rich on this work alone, but instead she just put it up on Github. Kudos to Laurie.

➕ show 2 replies

volume_tech • today at 6:28 PM

[dead]

hpcgroup • today at 2:14 PM

[flagged]

rcbdev • today at 4:59 AM

Am I the only one who feels the comments here don't sound organic at all?

➕ show 8 replies

villgax • today at 10:03 AM

[flagged]

➕ show 2 replies

dragonsenseiguy • last Tuesday at 11:55 PM

[flagged]

dombiscoff • today at 8:44 AM

[flagged]

rationalist • today at 2:58 AM

[flagged]

dinkumthinkum • today at 5:04 AM

This is an unreasonably good video. Hopefully, it inspires others to see we can still think hard and critically about technical things.

➕ show 1 reply

t1234s • today at 11:09 AM

Probably will get a lot of views from guys who have no idea what she is talking about.

➕ show 2 replies

alt Hacker News

DRAM has a design flaw from 1966. I bypassed it [video]

Comments