True but a cluster built on pipeline parallelism can naturally stream from multiple SSD's in parallel. That probably makes offload somewhat more effective. And you also have RAM caching available as a natural possibility.
You won't be RAM caching much of anything with experts that are 220b parameters worth of layers.
You won't be RAM caching much of anything with experts that are 220b parameters worth of layers.