Follow-up reading the most technical and research people here:
Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...
Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...
To try the speed on the playground: http://playground.kog.ai
It looks like DTP is a distinct architectural choice that would require training new models accordingly? This wouldn't be able to just run inference for existing models.