You can test it by asking it to: change the pitch of its voice, make specific sounds (like laughter), differentiate between words that are spelled the same but pronounced differently (record and record), etc.
Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?
Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?