I'm sorry, I'm clearly missing something but why would page size impact L1 cache size?

joha4270 • yesterday at 3:18 PM • 2 replies • view on HN

Replies

When you do a cache lookup, there is a "tag" which you use as an index during lookup. But once you do the lookup, you may need to walk a few entries in the corresponding "bucket" (identified by that tag) to find the matching cache line. The number of entries you walk is the associativity of the cache e.g. 8-way or 12-way associativity means there are 8 or 12 entries in that bucket. The larger the associativity, the larger the cache, but also it worsens latency, as you have to walk through the bucket. These are the two points you can trade off: do you want more total buckets, or do you want each bucket to have more entries?

To do this lookup in the first place, you pull a number of bits from the virtual/physical address you're looking up, which tells you what bucket to start at. The minimum page size determines how many bits you can use from these addresses to refer to unique buckets. If you don't have a lot of bits, then you can't count very high (6 bits = 2^6 = 64 buckets) -- so to increase the size of the cache, you need to instead increase the associativity, which makes latency worse. For L1 cache, you basically never want to make latency worse, so you are practically capped here.

Platforms like Apple Silicon instead set the minimum page size to 16k, so you get more bits to count buckets (8 bits = 256 buckets). Thus you can increase the size of the cache while keeping associativity low; L1 cache on Apple Silicon is something crazy like 192kb, and L2 (for the same reasons) is +16MB. x86 machines and software, for legacy reasons, are very much tied to 4k page size, which puts something of a practical limit on the size of their downstream caches.

Look up "Virtually Indexed, Physically Tagged" (VIPT) caches for more info if you want it.

➕ show 1 reply

adgjlsfhk1 • yesterday at 3:42 PM

This the the most cursed part of modern cpu design, but the TLDR is that programs use virtual addresses while CPUs use physical addresses which means that CPU caches need to include the translation from virtual to physical adress. The problem is that for L1 cache, the latency requirement of 3-4 cycles is too strict to first do a TLB lookup and then an L1 cache lookup, so the L1 can only be keyed on the bits of ram which are identical between physical and virtual addresses. With a 4k page size, you only have 6 bits between the size of your cache line (64 bytes) and the size of your page, which means that at an 8 way associative L1D, you only get 64 buckets*64 bytes/bucket=32 kbits of L1 cache. If you want to increase that while keeping the 4k page size, you need to up the associativity, but that has massive power draw and area costs, which is why on x86, L1D on x86 hasn't increased since core 2 duo in 2006.

➕ show 2 replies

alt Hacker News

Replies