8K QPS is probably quite trivial on their setup and a 10M dataset. I rarely use comparably small instances & datasets in my benchmarks, but on 100M-1B datasets on a larger dual-socket server, 100K QPS was easily achievable in 2023: https://www.unum.cloud/blog/2023-11-07-scaling-vector-search... ;)
Typically, the recipe is to keep the hot parts of the data structure in SRAM in CPU caches and a lot of SIMD. At the time of those measurements, USearch used ~100 custom kernels for different data types, similarity metrics, and hardware platforms. The upcoming release of the underlying SimSIMD micro-kernels project will push this number beyond 1000. So we should be able to squeeze a lot more performance later this year.
Author here. Appreciate the context—just wanted to add some perspective on the 8K QPS figure: in the VectorDBBench setting we used (10M, 768d, on comparable hardware to the previous leader), we're seeing double their throughput—so it's far from trivial on that playing field.
That said, self-reported numbers only go so far—it'd be great to see USearch in more third-party benchmarks like VectorDBBench or ANN-Benchmarks. Those would make for a much more interesting comparison!
On the technical side, USearch has some impressive work, and you're right that SIMD and cache optimization are well-established techniques (definitely part of our toolbox too). Curious about your setup though—vector search has a pretty uniform compute pattern, so while 100+ custom kernels are great for adapting to different hardware (something we're also pursuing), I suspect most of the gain usually comes from a core set of techniques, especially when you're optimizing for peak QPS on a given machine and index type. Looking forward to seeing what your upcoming release brings!