I’m a big fan of whisperKit for this, and they just added TTS. Great because they support features like speaker diarization (“who spoke when”) and custom dictionaries.
Here’s a load test where they run 4 models in realtime on same device:
- Qwen3-TTS - text to speech
- Parakeet v2 - Nvidia speech to text model
- Canary v2 - multilingual / translation STT
- Sortformer - speaker diarization (“who spoke when”)