This is just early fusion basically. FAIR did this 2 years ago now:

spott • today at 5:28 PM • 0 replies • view on HN

This is just early fusion basically.

FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818

I've been waiting for something like this to be released since then.

The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).

alt Hacker News