logoalt Hacker News

earthnailtoday at 12:07 PM1 replyview on HN

I don’t understand the approach

> TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.

So basically just concatenating the audio vectors without compression or discretization?

I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.


Replies

yorwbatoday at 1:11 PM

It's a variable-rate codec. The audio is still compressed, but by how much depends on the duration of the segment corresponding to a particular text token. The TTS model predicts one audio token per text token and its duration, and the audio decoder fills in a waveform of the appropriate length.