logoalt Hacker News

ahachetetoday at 3:49 AM0 repliesview on HN

> You definitely don't want to run a production system at saturation! But it's worthwhile to measure a complex system like Postgres at saturation, see when it gets there and how it behaves there, and then run at a slightly lower throughput.

I disagree. It's worthless a number at saturation. Because "a slightly lower throughput" is at best an unqualified hand-waving. Real numbers can be quite far from that saturation point.

Quote instead real production numbers. You can define them clearly, it's not that hard. E.g.: p95 below 10ms latency. That's it. Measure and report that number.

> I've done some testing (not in the blog post)--doubling instance size/IOPS doesn't improve performance significantly because it doesn't affect the WAL bottleneck. Local NVMe should have a significant impact in theory, but I haven't tested this myself.

But those would be interesting numbers to share! "Doesn't improve performance significantly" --sorry, I'm not big friend of unqualified data points. Is it 10%, 20%, 50%? And definitely, when measured at saturation, surely you don't see improvements. But if measured at an operational regime, you should probably see notable improvements (unless other scaling factors start to dominate, in which case your benchmark becomes much more meaningful because then you are finding Postgres scaling limits and not just the limits of the disk on which it's running). Changes the picture dramatically.

> Those are usage examples (notice the 1000 rps)--actual benchmarks were run at and were stable at much longer duration.

Sorry, but if you use that as an example, gives me little confidence about the real intent. But glad to hear you run at longer duration --add that information to the post! But again, that's not enough. Show the bloat and demonstrate how stable it is, given the tuning required to keep it contained, of course. Also show the tps over time --I'm sure it drops notably in the presence of checkpoints-- and then the "under 10ms latency at p95" will become dominated by write performance during checkpoints.

Because when you determine your SLOs, it's not at the happy path, but the opposite. And saying "Postgres can do 144K writes/sec on this machine" is beyond the happy path, so it's not meaningful for me.