Given the size of the datacenter class GPUs they're running these models on, don't they need to be processing multiple tenants concurrently per GPU to extract the full potential of the hardware?
I agree, shuffling the data between the CPU and GPU is itself fraught with peril. It's all the hairiest distributed systems problems combined with the sketchiest memory safety issues all in one place.