data bandwidth limits distributed training under current architectures. really interesting implications if we can make progress on that
Limits but doesn't prohibit. See https://www.primeintellect.ai/blog/intellect-3 - still useful and can scale enormously. Takes a particular shape and relies heavily on RL, but still big.
What bandwith limits? Im assuming the forward and backward passes have to be done sequentially?