TADA: Speech generation through text-acoustic synchronization

91 points • by smusamashah • today at 5:42 AM • 23 comments • view on HN

Comments

To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on.

The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.

➕ show 1 reply

earthnail • today at 12:07 PM

I don’t understand the approach

> TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.

So basically just concatenating the audio vectors without compression or discretization?

I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.

➕ show 1 reply

kavalg • today at 1:25 PM

MIT license, supported languages beyond english: ar, ch, de, es, fr, it, ja, pl, pt.

https://huggingface.co/HumeAI/tada-3b-ml

https://github.com/HumeAI/tada

tcbrah • today at 12:56 PM

the 0.09 RTF is wild but i wonder how much of that speed advantage disappears once you need voice cloning or fine grained prosody control. i use cartesia sonic for TTS in a video pipeline and the thing that actually matters for content creation isnt raw speed - its whether you can get consistent emotional delivery across like 50+ scenes without it drifting. the 1:1 text-acoustic alignment should help with hallucinations for sure but does it handle things like mid-sentence pauses or emphasis on specific words? thats where most open source TTS falls apart IMO

➕ show 1 reply

ilaksh • today at 1:23 PM

okay so they say text continuation only without fine tuning. I assume that means that we can't use it as a replacement for TTS in an AI agent chat? Because it will not work without enough context?

Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough?

I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.

mpalmer • today at 12:19 PM

"Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.

➕ show 1 reply

qinqiang201 • today at 8:53 AM

Could it run on Macbook? Just on GPU device?

OutOfHere • today at 7:41 AM

Will this run on CPU? (as opposed to GPU)

➕ show 3 replies

octoclaw • today at 10:02 AM

[dead]

theturtle • today at 9:31 AM

[dead]

zacklee1988 • today at 6:52 AM

[dead]

alt Hacker News

TADA: Speech generation through text-acoustic synchronization

Comments