logoalt Hacker News

OpenAI’s WebRTC problem

342 pointsby atgctglast Thursday at 5:11 PM86 commentsview on HN

Comments

Sean-Dertoday at 1:33 AM

Responding to some technical points first, but then after that I do see a future that isn't WebRTC. I don't think it matches where WebTransport+WebCodecs etc is going though.

> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate

This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)

> TTS is faster than real-time

https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms

> We really hope the user’s source IP/port never changes, because we broke that functionality.

That is supported. When new IP for ufrag comes in its supported

> It takes a minimum of 8* round trips (RTT)

That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/

> I’d just stream audio over WebSockets

You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)

----

I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.

show 7 replies
jedbergtoday at 2:34 AM

I have a lot of experience in this area (and some patent applications). For Alexa, the device established a connection back to the server and then kept that open, sending basically HTTP2/SPDY/Something like it over the wire after it detected the wake word. This allowed the STT start processing before you finish talking, so there is only a small delay in processing the last few chunks of your utterance.

The answer came back over the same connection.

In the case of OpenAI, they can't exactly keep a persistent connection open like Alexa does, but they can use HTTP2 from the phone and both iOS and Android will pretty much take care of that connection magically.

The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms. Especially in the age of mobile phones, where most people are used to their real time human to human communications to have a delay.

(If you work at OpenAI or Anthropic, give me a shout, I'm happy to get into more details with you)

show 1 reply
awkiitoday at 12:56 AM

This poor soul. There are few protocols I hate implementing more than WebRTC. Getting a simple client going means you need to quickly acclimate to SDP, TURN/STUN, ice-candidates, offers, peer-to-peer protocols, and the complex handshake that is implemented from scratch each time. I can't imagine re-writing the whole trenchcoat of protocols and unintended "best-practices".

show 4 replies
r2vcaptoday at 12:41 AM

This is frustratingly one-sided writing. Yeah, WebRTC has limitations, but relying on a standard buys you a lot of correctness and reduces long-term engineering cost. The fact that WebRTC is complicated does not mean it is wrong; it means real-time media over the public internet is complicated.

Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.

show 4 replies
fidotrontoday at 1:10 AM

> WebRTC is designed to degrade and drop my prompt during poor network conditions

You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.

show 3 replies
Aeroitoday at 2:43 AM

I run the gemini live api over a mesh hosted managed webrtc cloud. works fantastic, and Ive been running it for 2 years. you can try websocket, handle ephemeral keys, ect ect. but when you speak with people running voice agents at scale in this space, many of the issues are solved with webRTC and pipecat and the many resources allocated to solved problems in this space. It certainly feels overkill, and it probably is, but once connection is established, it's pretty magical. the startup time and buffering has been solved for quicker voice connections too, https://github.com/pipecat-ai/pipecat-examples/tree/main/ins... (video is harder)

solatictoday at 7:57 AM

Why does the voice need to be sent to the server? Why not perform speech-to-text on-device? Is the p10 phone/laptop not capable of this yet, despite every "dictation" feature I see in every modern OS?

show 1 reply
vachinatoday at 10:23 AM

Why worry for OpenAI. Their product will fail if it doesn’t work. Then they will figure it all out later.

mohsen1today at 5:54 AM

I've been using LiveKit which is also WebRTC based and it is super annoying when speed slows down or speeds up at times when connection is not robust. We were using OpenAI's websocket based RealTime audio which was way too slow. So I don't know which one is better. Generally our users like the LiveKit implementation better so maybe WebRTC with enough clever hacks is the answer.

This blog was super insightful for me to understand what are the root problems in the current implementation though.

yaloktoday at 5:49 AM

There're tons of ways to fine-tune WebRTC that it wouldn't corrupt audio in poor network - it has all of the controls to smoothly trade-off latency vs quality. Not just NACKs - FEC, disable PLC/Acceleration/Deceleration, larger JB (tons of parameters) etc.

Most of the glitches I heard with OpenAI's Voice were not WebRTC related - but rather, to my ear, they sounded more like realtime issues with their inference - which is a very different component to optimize.

splittydevtoday at 8:14 AM

Amazing read. Blog posts rarely keep my attention like this one.

Aeroitoday at 2:36 AM

there are a lot of extremely smart people that have come back to webRTC time and time again because it continues to solve problems other methods and protocols can't. with saying that, quic is certainly interesting going forward, but i primarily stream voice + vision at 1fps so it just makes sense, and websockets fail and are insecure at scale for this use case (see https://www.daily.co/videosaurus/websockets-and-webrtc/) . also just listen to sean in this thread, dude knows whats up.

lpln3452today at 1:27 AM

I haven't really experienced disconnections while using ChatGPT. Gemini is the frustrating part. Simply backgrounding the app (and the web version too) and resuming it causes the response or the conversation with an assigned ID to disappear. Haha.

show 1 reply
jongjongtoday at 11:11 AM

I've long had the feeling that WebRTC was intentionally over-engineered. Over-engineered and poorly documented.

IMO, tech standards should be simple and minimal and people should be able to implement whatever they want on top. I tend to stay away from complex web standards.

nutanctoday at 5:14 AM

Most of the problems happen because we want to simulate human conversations. While thats a good goal to have, another approach is to let the user know clearly they are talking to a bot. You will be surprised at how accomodating users can be when they know they are talking to a bot and want their queries resolved.

schappimtoday at 3:47 AM

"WebRTC is the problem" is bait; his real claim is "WebRTC has annoying transport-layer characteristics that hurt cloud Voice AI scaling"...

Having just had to tackle this again for my own startup, I'm reminded about what you would lose by ditching WebRTC - the audio DSP pipeline, transmit side VAD, echo cancellation, noise suppression, NAT traversal maturity, codec integration, browser ubiquity etc.

show 1 reply
elephantumtoday at 5:32 AM

My biggest frustration with WebRTC was precisely captured in the article: even if you don't need p2p and your video source is the process on the same host with your browser, you have to dance around connection setup like you're on a different side of a planet

fy20today at 2:43 AM

Nice fun article. Gives me Why The Lucky Stiff vibes.

gozzootoday at 7:40 AM

I didn't understand - why is WebRTC good for Google Meet and not good for all other conferencing apps?

hnavtoday at 4:02 AM

Exactly what I thought when I read the original article, though to be fair WebTransport is barely now entering the mainstream with Safari shipping support this year.

sam1rtoday at 2:02 AM

>> ... I say hi to <strike> Scarlett Johansson <strike>

Had a nice chuckle.

molszanskitoday at 4:14 AM

I remember using webrtc data channel for p2p video. Browser to browser UDP is neat :) fun memories. Thank you for the read

spongebobstoestoday at 1:31 AM

this misses a few key things but hits on many others

webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result

I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat

> and then a GPU pretends to talk to you via text-to-speech

OpenAI is speech-to-speech, there is no TTS in voice mode

> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection

signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further

ultimately though, it comes down to

> It’s not like LLMs are particularly responsive anyway

I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced

to be fair, the new models were released the day after this MoQ blog was published

show 1 reply
perryizgr8today at 9:25 AM

How is OpenAI Voice mode any different than a Whatsapp call? Ignoring the part that there is a GPU on the other side instead of a human. But what is the technical challenge in the voice call portion? It seems like that has been a solved problem for a long time now.

keizotoday at 2:11 AM

interesting read albeit over my head, but i spent half of yesterday comparing Gemini Live (websockets) vs gpt-realtime-2 and while gpt is super good, seemingly more robust. Gemini connects faster.

giancarlostorotoday at 12:37 AM

Probably because WebTransport is the lesser known alternative to WebRTC.

show 1 reply
brcmthrowawaytoday at 3:05 AM

This is interesting. Does niche knowledge in this area command $1mn salary?

show 2 replies
Giefo6ahtoday at 1:06 AM

Yet another victim of IPv4, and you still find countless detractors of IPv6 on every thread where it's mentioned.

show 2 replies
coalstartprobtoday at 1:56 AM

[dead]