I agree that full duplex is the amazing bit. For instance, the three engineers shouting trivia quest...

vessenes • today at 11:54 AM • 0 replies • view on HN

I agree that full duplex is the amazing bit. For instance, the three engineers shouting trivia questions while a timer is running — that’s extremely novel as far as I can tell.

I’d like to believe from the demos that this ability to wait kind of falls out of the model as an emergent property — perhaps coming out of a small RL loop - rather than a specific behavior trained, a-la a VAD component in a stack. Either way, I would guess that VAD absolutely cannot do this right now — interruptions are highly annoying on all voice interaction experiences, and if it were a simple matter of better post training, SOMEONE would have done this, e.g. elevenlabs.

But, I disagree on your idea that this is too expensive/too hard to replicate. For me, yes. But, there’s an existence proof — a small team at a new company just did this without a real roadmap, certainly for less than $1b dollars and probably in less than two years. They are almost certainly less skilled at your list of needs to replicate than teams at the frontier labs, who have been given a roadmap.. So I don’t think it’s as difficult as you propose, from an organizational skills perspective.

alt Hacker News