There's a minimum possible latency just given the structure of language and how humans process ...

regularfry • yesterday at 12:39 PM • 1 reply • view on HN

There's a minimum possible latency just given the structure of language and how humans process phonemes. Spoken language isn't quite unambiguously causal so there's a limit to how far you can go for a given accuracy. I don't know where the efficiency curve is though. It wouldn't surprise me if 100ms was pushing it.

Replies

moffkalast • yesterday at 12:58 PM

Yeah the metric would be the total processing latency after that. I've found that VAD is honestly harder to get right than STT and if that fails, STT only gets garbage to process. Even humans sometimes have issues figuring out when exactly someone is done talking.

alt Hacker News

Replies