How do you split the model between multiple GPUs?

dev_l1x_be • today at 1:41 PM • 1 reply • view on HN

Replies

With "only" 32B active params, you don't necessarily need to. We're straying from common home users to serious enthusiasts and professionals but this seems like it would run ok on a workstation with a half terabyte of RAM and a single RTX6000.

But to answer your question directly, tensor parallelism. https://github.com/ggml-org/llama.cpp/discussions/8735 https://docs.vllm.ai/en/latest/configuration/conserving_memo...

alt Hacker News

Replies