logoalt Hacker News

weitendorfyesterday at 5:54 PM1 replyview on HN

Agree with this and I have been thinking about it recently as well. I think you could implement a cord-like vocabulary to identify large duplicated substrings for exact deduplication and pairwise correlations or vocabulary profiles/small classifiers for forward-looking or speculative deduplications. A clear example is the GPL license, it’s a large substring you might encounter often and highly likely to be accompanied by lots of c code.

This is probably something that you’d be doing on the CPU though before sending anything to the GPU, though that’s definitely the sensitive surface since it’s hardware without good multitenancy. I assume the interface between the CPU and GPU is where you would be most likely to make a mistake where you start decoding data from one fd that was meant for another, or from the wrong position, and get someone else’s data.

I wouldn’t be confident that these are active exploits from deliberately abusing kv cache optimizations though, possibly just the kind of bugs you get from active low level performance tuning/systems work. Since this is something I have seen across providers lately I personally suspect it to be a driver issue.


Replies

27183today at 1:49 AM

Given the size of the datacenter class GPUs they're running these models on, don't they need to be processing multiple tenants concurrently per GPU to extract the full potential of the hardware?

I agree, shuffling the data between the CPU and GPU is itself fraught with peril. It's all the hairiest distributed systems problems combined with the sketchiest memory safety issues all in one place.