Responding to some technical points first, but then after that I do see a future that isn't WebRTC. I don't think it matches where WebTransport+WebCodecs etc is going though.
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
> TTS is faster than real-time
https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms
> We really hope the user’s source IP/port never changes, because we broke that functionality.
That is supported. When new IP for ufrag comes in its supported
> It takes a minimum of 8* round trips (RTT)
That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/
> I’d just stream audio over WebSockets
You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)
----
I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.
> > …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
> This is the opposite of the feedback I get. Users want instant responses.
I am skeptical that you are getting feedback that users prefer instant wrong results to 200ms-lag correct results.
Deeply skeptical!
> . You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily.
WebRTC is complex, even if it's a library (even if it's a library built into the browser they're already using). For a client/server voice interaction, I don't see why you would willingly use it. Ship voice samples over something else; maybe borrow some jitter buffer logic for playback.
My job currently involves voice and video conferencing and 1:1 calls, and WebRTC is so much complexity... it got our product going quickly, but when it does unreasonable things, it's a challenge to fix it; even though we fork it for our clients.
I could write an enormous rant about TURN [1]. But all of the webrtc protocol suite is designed for an internet that doesn't exist.
[1] Turn should allocate a rendesvous id rather than an ephemeral port when the turn client requests an allocation. Then their peer would connect to the turn server on the service port and request a connection to the rendesvous id, without needing the client to know the peer address and add a permission. It would require less communication to get to an end to end relayed connection. Advanced clusters could encode stuff in the id so the client and peer could each contact a turn server local to them and the servers could hook things up; less advanced clusters would need to share the turn server ip and service port(s) with the id.
> You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
You only need to send ~1 second at a time. There's no reason to send 20ms or 10 min at a time. Both are stupid.
Delivery of first phoneme and delivery of the important information don't have to be coupled. Politicians on TV get very good at this particular trick, they've got a set of stock phrases which basically fill time while their brain gets in gear. We just need something to fill the gap so our System 1 doesn't lose confidence in the interaction.
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
I disagree with this SO strongly. I find the conversational voice mode to be a game changer because you can actually have an almost normal conversation with it. I'd be thrilled if they could shave off another 50-100ms of latency, and I might stop using it if they added 200ms. If I want deep research I'll use text and carefully compose my prompt; when I'm out and about I want to have a conversation with the Star Trek computer.
Interestingly I'm involved with a related effort at a different tech company and when I voiced this opinion it was clear that there was plenty of disagreement. This still surprises me since it seems so obvious to me that conversational fluidity is the number one most important feature.
> This is the opposite of the feedback I get. Users want instant responses.
Did they really say they prefer fast response over accurate repsonse?
HELLO MR SEAN,
1. Of course users want lower latency, but they also want fewer instances where the LLM "misheard" them. It would be amazing to run A/B experiments on the trade-off between latency vs quality, but WebRTC makes that knob difficult to turn.
2. I'm obviously not an TTS expert, but what benefit is there to trickling out the result? The silicon doesn't care how quickly the time number increments?
3. Yeah, sometimes the client is aware when their IP changes and can do an ICE renegotiation. But often they aren't aware, and normally would rely on the server detecting the change, but that's not possible with your LB setup. It's not a big deal, just unfortunate given how many hoops you have to jump through already.
4. Okay, that draft means 7 RTTs instead of 8 RTTs? Again some can be pipelined so the real number is a bit lower. But like the real issue is the mandatory signaling server which causes a double TLS handshake just in case P2P is being used.
5. Of course WebRTC is easier for a new developer because it's a black box conferencing app. But for a large company like OpenAI, that black box starts to cause problems that really could be fixed with lower level primitives.
I absolutely think you should mess around with RTP over QUIC and would love to help. If you're worried about code size, the browser (and one day the OS) provides the QUIC library. And if you switch to something closer to MoQ, QUIC handles fragmentation, retransmissions, congestion control, etc. Your application ends up being surprisingly small.
The main shortcoming with RoQ/MoQ is that we can't implement GCC because QUIC is congestion controlled (including datagrams). We're stuck with cubic/BBR when sending from the browser for now.