> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly i...

randomNumber7 • today at 4:49 PM • 1 reply • view on HN

> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)

Replies

spott • today at 5:34 PM

https://newsletter.maartengrootendorst.com/p/a-visual-guide-... (in a link from here: https://developers.googleblog.com/gemma-4-12b-the-developer-..., which was linked in the text of the post, but not the linkdump at the end).

alt Hacker News

Replies