MoE isn't inherently better, but I do think it's still an under explored space. When your sparse model can do 5 runs on the same prompt in the same time as a dense model takes to generate one, there opens up all sorts of interesting possibilities.