Isn't the Sora video model a ViT with spatiotemporal inputs (so they've found a way to compress that down), but at the same time LeCunn wouldn't consider that a world model?
VideoGen models have to have decoder output heads that reproduce pixel level frames. The loss function involes producing plausible image frames that requires a lot of detailed reconstruction.
I assume that when you get out of bed in the morning, the first thing you dont do is paint 1000 1080p pictures of what your breakfast looks like.
LeCunns models predict purely in representation space and output no pixel scale detailed frames. Instead you train a model to generate a dower dimension representation of the same thing from different views, penalizing if the representation is different ehen looking at the same thing
VideoGen models have to have decoder output heads that reproduce pixel level frames. The loss function involes producing plausible image frames that requires a lot of detailed reconstruction.
I assume that when you get out of bed in the morning, the first thing you dont do is paint 1000 1080p pictures of what your breakfast looks like.
LeCunns models predict purely in representation space and output no pixel scale detailed frames. Instead you train a model to generate a dower dimension representation of the same thing from different views, penalizing if the representation is different ehen looking at the same thing