> If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rare
c.f. hardware lotter https://arxiv.org/abs/2009.06489