This is really cool. I think what I really wanna see though is a full multimodal Text and Speech mod...

Serenacula • yesterday at 8:34 AM • 1 reply • view on HN

This is really cool. I think what I really wanna see though is a full multimodal Text and Speech model, that can dynamically handle tasks like looking up facts or using text-based tools while maintaining the conversation with you.

Replies

sigmoid10 • yesterday at 8:43 AM

OpenAI has been offering this for a while now, featuring text and raw audio input+output and even function calling. Google and xAI also offer similar models by now, only Anthropic still relies on TTS/STT engine intermediates. Unfortunately the open-weight front is still lagging behind on this kind of model.

alt Hacker News

Replies