The audio side is even more interesting, as it seems they totally got rid of positional embedding ar...

mchinen • today at 5:21 PM • 1 reply • view on HN

The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it.

> Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

Replies

make3 • today at 5:33 PM

I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning

➕ show 2 replies

alt Hacker News

Replies