logoalt Hacker News

tcbrahtoday at 12:56 PM1 replyview on HN

the 0.09 RTF is wild but i wonder how much of that speed advantage disappears once you need voice cloning or fine grained prosody control. i use cartesia sonic for TTS in a video pipeline and the thing that actually matters for content creation isnt raw speed - its whether you can get consistent emotional delivery across like 50+ scenes without it drifting. the 1:1 text-acoustic alignment should help with hallucinations for sure but does it handle things like mid-sentence pauses or emphasis on specific words? thats where most open source TTS falls apart IMO


Replies

regularfrytoday at 3:58 PM

Given that it's one-to-one audio and text tokens, you'd get mid-sentence pauses if you just stopped feeding it.