logoalt Hacker News

bootsmannyesterday at 9:45 AM4 repliesview on HN

Really? Doesn’t v4 locally make the inserts into the B-Tree pretty messy? I was taught to use v7 because it allows writes to be a lot faster due to memory efficient paging by the kernel (something you lose with v4 because the page of a subsequent write is entirely random).


Replies

sintaxyesterday at 10:44 AM

https://www.thenile.dev/blog/uuidv7#why-uuidv7 has some details: " UUID versions that are not time ordered, such as UUIDv4 (described in Section 5.4), have poor database-index locality. This means that new values created in succession are not close to each other in the index; thus, they require inserts to be performed at random locations. The resulting negative performance effects on the common structures used for this (B-tree and its variants) can be dramatic. ".

Also mentioned on HN https://news.ycombinator.com/item?id=45323008

show 1 reply
da_chickenyesterday at 1:37 PM

It's memory and disk paging both.

There's also a hot spot problem with databases. That's the performance problem with autoincrement integers. If you are always writing to the same page on disk, then every write has to lock the same page.

Uuidv7 is a trade off between a messy b-tree (page splits) and a write page hot spot (latch contention). It's always on the right side of the b-tree, but it's spread out more to avoid hot spots.

That still doesn't mean you should always use v7. It does reversibly encode a timestamp, and it could be used to determine the rate that ids are generated (analogous to the German tank problem). If the uuidv7 is monotonic, then it's worse for this issue.

out_of_protocolyesterday at 10:32 AM

v7 exposes creation date, and maybe you don't want that. So, depends on use-case

show 1 reply
matjayesterday at 10:01 AM

In distributed databases I've worked with, there's usually something like a B-tree per key range, but there can be thousands of key ranges distributed over all the nodes in the cluster in parallel, each handling modifications in a LSM. The goal there is to distribute the storage and processing over all nodes equally, and that's why predictable/clustered IDs fail to do so well. That's different to the Postgres/MySQL scenario where you have one large B-tree per index.