Isn't the Sora video model a ViT with spatiotemporal inputs (so they've found a way to com...

energy123 • yesterday at 5:38 PM • 1 reply • view on HN

Isn't the Sora video model a ViT with spatiotemporal inputs (so they've found a way to compress that down), but at the same time LeCunn wouldn't consider that a world model?

Replies

LarsDu88 • yesterday at 11:25 PM

VideoGen models have to have decoder output heads that reproduce pixel level frames. The loss function involes producing plausible image frames that requires a lot of detailed reconstruction.

I assume that when you get out of bed in the morning, the first thing you dont do is paint 1000 1080p pictures of what your breakfast looks like.

LeCunns models predict purely in representation space and output no pixel scale detailed frames. Instead you train a model to generate a dower dimension representation of the same thing from different views, penalizing if the representation is different ehen looking at the same thing

alt Hacker News

Replies