You can even train in 4 & 8 bits with newer microscaled formats! From https://arxiv.org/pdf/2310.10537 to gpt-oss being trained (partially) natively in MXFP4 - https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-...
To Nemotron 3 Super, which had 25T of nvfp4 native pretraining! https://docs.nvidia.com/nemotron/0.1.0/nemotron/super3/pretr...
Newer quantization approaches are even better, 4-bits gets you no meaningful loss relative to FP16: https://github.com/z-lab/paroquant
Hopefully Microsoft keeps pushing BitNet too, so only "1.58" bits are needed.
I think fractional representations are only relevant for training at this point, and bf16 is sufficient, no need for fp4 and such.