I don’t understand the approach
> TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.
So basically just concatenating the audio vectors without compression or discretization?
I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.
MIT license, supported languages beyond english: ar, ch, de, es, fr, it, ja, pl, pt.
the 0.09 RTF is wild but i wonder how much of that speed advantage disappears once you need voice cloning or fine grained prosody control. i use cartesia sonic for TTS in a video pipeline and the thing that actually matters for content creation isnt raw speed - its whether you can get consistent emotional delivery across like 50+ scenes without it drifting. the 1:1 text-acoustic alignment should help with hallucinations for sure but does it handle things like mid-sentence pauses or emphasis on specific words? thats where most open source TTS falls apart IMO
okay so they say text continuation only without fine tuning. I assume that means that we can't use it as a replacement for TTS in an AI agent chat? Because it will not work without enough context?
Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough?
I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.
"Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.
Could it run on Macbook? Just on GPU device?
[dead]
[dead]
[dead]
To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on.
The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.