I have a lot of experience in this area (and some patent applications). For Alexa, the device established a connection back to the server and then kept that open, sending basically HTTP2/SPDY/Something like it over the wire after it detected the wake word. This allowed the STT start processing before you finish talking, so there is only a small delay in processing the last few chunks of your utterance.
The answer came back over the same connection.
In the case of OpenAI, they can't exactly keep a persistent connection open like Alexa does, but they can use HTTP2 from the phone and both iOS and Android will pretty much take care of that connection magically.
The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms. Especially in the age of mobile phones, where most people are used to their real time human to human communications to have a delay.
(If you work at OpenAI or Anthropic, give me a shout, I'm happy to get into more details with you)
"The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms"
Not my experience, running around 6,000 conversations per day with voice, with webrtc + cascading (stt/llm/tts) architecture.
Maybe I misunderstood your comment, but that 500ms is basically the floor of a stat of the art voice implementation these days - if you are lucky and don't skimp, and do various expensive things like speculative decoding and reasoning. 450ms on the LLM pass alone. Every ms counts in commercial applications of voice ai. If you add 200ms or 300ms to that, it really degrades the conversation.
We do a lot of voice stuff to support our business, largely with unsophisticated, non technical users. Last year's attempts, with measured turn to turn latencies of around 1200ms-1500ms, led to a lot of user confusion, interruptions, abandoned conversations and generally very unpleasant experiences. We are at around 700ms turn to turn now, depending on tool usage needed, and its approaching an OK experience, rivalling an interaction with an actual human. We are spending quite a lot to shave another 100ms off that. We do expensive, wasteful things such as speculative LLM passes, we do speculative tool executions (do a few LLM inferences as the user speaks, but don't actually execute non-idempotent tool calls before you know that that LLM pass is usable and the user did not say anything important at the tail end of their sentence) just to shave 100-200ms. When someone says 500ms is irrelevant I am sure they are describing some other use case, not human-to-AI voice interactions.
In my experience with voice AI, the problem is not with some occasional dropped webrtc packets. The real hard problem is with strong background noises, echo, and of course accents. WebRTC with its polished AEC implementations helps quite a lot at least with echos. I get the protocol is a major PITA to implement at OpenAI scale, but for anything but hyperscale applications there is lots of good, viable solutions and commercial providers (say, Daily for instance) that make it a no problem. The real problems to solve are still elswhere. But boy, add 500ms to my latency budget and you've killed my application.