I don’t understand the approach
> TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.
So basically just concatenating the audio vectors without compression or discretization?
I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.
It's a variable-rate codec. The audio is still compressed, but by how much depends on the duration of the segment corresponding to a particular text token. The TTS model predicts one audio token per text token and its duration, and the audio decoder fills in a waveform of the appropriate length.