logoalt Hacker News

Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

352 pointsby ipotapovtoday at 7:43 AM114 commentsview on HN

Comments

armcattoday at 9:23 AM

I really like this, and have actually tried (unsuccessfully) to get PersonaPlex to run on my blackwell device - I will try this on Mac now as well.

There are a few caveats here, for those of you venturing in this, since I've spent considerable time looking at these voice agents. First is that a VAD->ASR->LLM->TTS pipeline can still feel real-time with sub-second RTT. For example, see my project https://github.com/acatovic/ova and also a few others here on HN (e.g. https://www.ntik.me/posts/voice-agent and https://github.com/Frikallo/parakeet.cpp).

Another aspect, after talking to peeps on PersonaPlex, is that this full duplex architecture is still a bit off in terms of giving you good accuracy/performance, and it's quite diffiult to train. On the other hand ASR->LLM->TTS gives you a composable pipeline where you can swap parts out and have a mixture of tiny and large LLMs, as well as local and API based endpoints.

show 7 replies
vessenestoday at 8:32 AM

This is cool. It makes me want an unsloth quant though! A 7b local model with tool calling would be genuinely useful, although I understand this is not that.

UPDATE: I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.

show 5 replies
KaiserPistertoday at 1:31 PM

I am strongly put off by the LLM writing in this piece. It makes me question quality of the project before even attempting a download.

Who would put effort into building this only to compose a low effort puff piece?

show 4 replies
d4rkp4tterntoday at 1:16 PM

Built out the demo on my M1 Max Macbook and it was absolutely terrible. Around 10 seconds for each reply, and even then it was saying something totally unrelated.

show 4 replies
4dregresstoday at 8:39 AM

This sounds quite dangerous https://www.theguardian.com/technology/2026/mar/04/gemini-ch...

show 3 replies
d4rkp4tterntoday at 3:20 PM

Sesame was the best full-duplex voice demo I ever came across, wonder what is up with them now https://app.sesame.com/

show 2 replies
scosmantoday at 9:02 AM

I’m a big fan of whisperKit for this, and they just added TTS. Great because they support features like speaker diarization (“who spoke when”) and custom dictionaries.

Here’s a load test where they run 4 models in realtime on same device:

- Qwen3-TTS - text to speech

- Parakeet v2 - Nvidia speech to text model

- Canary v2 - multilingual / translation STT

- Sortformer - speaker diarization (“who spoke when”)

https://x.com/atiorh/status/2027135463371530695

sowbugtoday at 4:05 PM

I would like my phone to forward spam calls to this, with a system prompt to slowly provide fake personal and financial information intermingled with chatter about sports and the weather.

show 1 reply
ilakshtoday at 11:57 AM

Does anyone have working code for fine-tuning PersonaPlex for outgoing calls? I have tried to take the fine tuning LoRA stuff from Kyutai/moshi-finetune and apply it to the personaplex code. Or more accurately,various LLMs have worked on that.

I have something that seems to work in a rough way but only if I turn the lora scaling factor up to 5 and that generally screws it up in other ways.

And then of course when GPT-5.3 Codex looked at it, it said that speaker A and speaker B were switched in the LoRA code. So that is now completely changed and I am going to do another dataset generation and training run.

If anyone is curious it's a bit of a mess but it's on my GitHub under runvnc moshi-finetune and personaplex. It even has a gradio app to generate data and train. But so far no usable results.

sarmike31today at 5:24 PM

If you're interested in Demos without installing the thing, he has a site here: https://research.nvidia.com/labs/adlr/personaplex/

jwrtoday at 8:42 AM

As a heavy user of MacWhisper (for dictation), I'm looking forward to better speech-to-text models. MacWhisper with Whisper Large v3 Turbo model works fine, but latency adds up quickly, especially if you use online LLMs for post-processing (and it really improves things a lot).

show 5 replies
sgttoday at 8:51 AM

My problem with TTS is that I've been struggling to find models that support less common use cases like mixed bilingual Spanish/English and also in non-ideal audio conditions. Still haven't found anything great, to be honest.

show 2 replies
dubeyetoday at 11:16 AM

It doesn't feel like speech recognition has been improving at the same rate as other generative AI. It had a big jump up to about 6% WER a year or two ago, but it seems to have plateaued. Am I just using the wrong model? Or is human level error rate, some kind of limit, which I estimate to be about 5%.

Krissotoday at 1:12 PM

Awesome, but given the Apple Silicon population and configuration, how does this fare on a M1 with 8GB of total ram? I'd imagine this makes running another llm for tool-calls and inference tough to impossible.

ruhithtoday at 2:37 PM

Cool demo but without tool calling this is basically a fast parrot. The traditional pipeline is slower but at least you can plug in a real brain.

show 1 reply
michelsedghtoday at 8:39 AM

its really cool, but for real life use cases i think it lacks the ability to have a silent text stream output for example for json and other stuff so as its talking it can run commands for you. right now it can only listen and talk back which limits what u can make with this a lot

WeaselsWintoday at 8:27 AM

This full duplex spoken thing, it's already for quite a long time being used by the big players when using the whatever "conversation mode" their apps offer, right? Those modes always seemed fast enough to for sure not be going through the STT->LLM->TTS pipeline?

show 2 replies
ricardobeattoday at 12:24 PM

No mention of tool use. If the model cannot emit both text and audio at the same time, to enable tools, it’s not really useful at all for voice agents.

Serenaculatoday at 8:34 AM

This is really cool. I think what I really wanna see though is a full multimodal Text and Speech model, that can dynamically handle tasks like looking up facts or using text-based tools while maintaining the conversation with you.

show 1 reply
nerdsnipertoday at 9:34 AM

Do we have real-time (or close-enough) face-to-face models as well? I'd like to gracefully prove a point to my boss that some of our IAM procedures need to be updated.

show 1 reply
pothamktoday at 8:41 AM

What’s interesting about full-duplex speech systems isn’t just the model itself, but the pipeline latency.

Even if each component is fast individually, the chain of audio capture → feature extraction → inference → decoding → synthesis can quickly add noticeable delay.

Getting that entire loop under ~200–300ms is usually what makes the interaction start to feel conversational instead of “assistant-like”.

show 2 replies
Tepixtoday at 8:26 AM

It's cool tech and I will give it a try. I will probably make a 8-bit-quant instead of the 4-bit which should be easy with the provided script.

That said, I found the example telling:

Input: “Can you guarantee that the replacement part will be shipped tomorrow?”:

Reponse with prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”

It's not surprising that people have little interest in talking to AI if they're being lied to.

PS: Is it just me or are we seing AI generated copy everywhere? I just hope the general talking style will not drift towards this style. I don't like it one bit.

show 3 replies
nicktikhonovtoday at 10:28 AM

From what I've seen, it's really easy to get PersonaPlex stuck in a death spiral - talking to itself, stuttering and descending deeper and deeper into total nonsense. Useless for any production use case. But I think this kind of end-to-end model is needed to correctly model conversations. STT/TTS compresses a lot of information - tone, timing, emotion out of the input data to the model, so it seems obvious that the results will always be somewhat robotic. Excited to see the next iteration of these models!

khalictoday at 9:27 AM

ugh, qwen, I wish they'd use an open data model for this kind of projects

apitoday at 11:35 AM

How close are we to the Star Trek universal translator?

show 1 reply
Yanko_11today at 2:26 PM

[dead]

octoclawtoday at 10:06 AM

[dead]

krasikratoday at 12:25 PM

[flagged]

show 1 reply