logoalt Hacker News

raw_anon_1111today at 12:42 AM0 repliesview on HN

The way that voice assistants work even in the age of LLMs are:

Voice —> Speech to Text -> LLM to determine intent -> JSON -> API call -> response -> LLM -> text to speech.

TTFT is irrelevant, you have to process everything through the pipeline before you can generate a response. A fast model is more important than a good model

Source: I do this kind of stuff for call centers. Yes I know modern LLMs don’t go through the voice -> text -> LLM -> text -> voice anymore. But that only works when you don’t have to call external sources