Local inference is definitely a good way to go here. Latency when talking to an embodied robot is extremely noticeable though, and pauses during voice chats are way worse than during text chats.
It’s something I’m exploring - stay tuned :)