A discrete consumer GPU card doesn't have enough fast RAM to run a very large model that hasn't been quanitized to hell.
That's why all the projects streaming models into the GPU from an SSD popped up recently.
Yes. There’s just no way to get above 1t/s that way with a large model.
Yes. There’s just no way to get above 1t/s that way with a large model.