logoalt Hacker News

anematodeyesterday at 3:35 PM1 replyview on HN

Definitely. As an extreme but fun example... in one project I had a massive hash map (~700 GB or so) that was concurrently read to/written from by 256 threads. The entries themselves were only 16 bytes and so I could use atomic cmpxchg, but the problem I hit was that even with 1GB huge pages, I was running out of dTLB entries. So I assigned each thread to a subregion of the hash table, then used channels between each pair of threads to handle the reads and writes (and restructured the program a bit to allow this). Since the dTLB budget is per core, this allowed me to get essentially 0 dTLB misses, and ultimately sped up the program by ~2x


Replies

senderistayesterday at 4:57 PM

The "delegation pattern" for datastructures:

https://timharris.uk/papers/2013-opodis.pdf

show 1 reply