This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...
Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.
I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:
> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?
404 on https://mistralai-voxtral-mini-realtime.hf.space/gradio_api/... for me (which shows up in the UI as a little red error in the top right).
Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.
I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.
Doesn't seem to work for me - tried in both Firefox and Chromium and I can see the waveform when I talk but the transcription just shows "Awaiting audio input".
It is quite impressive.
I have seen the same impressive performance about 7 months ago here: https://kyutai.org/stt
If I look at the architecture of Voxtral 2, it seems to take a page from Kyutai’s delayed stream modeling.
The reason the delay is configurable is that you can delay the stream by a variable number of audio tokens. Each audio token is 80 ms of audio, converted to a spectrogram, fed to a convnet, passed through a transformer audio encoder, and the encoded audio embedding is passed, with a history of 1 audio embedding per 80 ms, into a text transformer, which outputs text embedding, then converted to a text token (which is thus also worth 80ms, but there is a special [STREAMING_PAD] token to skip producing a word).
There is no cross-attention in either Kyutai's STT nor in Voxtral 2, unlike Whisper's encoder-decoder design!
It can transcribe Eminem's Rap God fast sequence, really, really impressive.
This model was able to transcribe Bad Bunny lyrics over the sound of the background music, played casually from my speakers. Impressive, to me.
Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.
I can't get that demo to work. Tried with both Firefox and Chrome.
I’ve been using AquaVoice for real-time transcription for a while now, and it has become a core part of my workflow. It gets everything, jargon, capitalization, everything. Now I’m looking forward to doing that with 100% local inference!
Doesn’t seem to work in Safari on iOS 26.2, iPhone 17 Pro, just about anything extra disabled.
It's really nice although I've got a sentence in French when I was speaking Italian but I corrected myself in the middle of a word.
But I'm definitely going to keep an eye on this for local-only TTS for Home Assistant.
Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.
Here European Multilingual-Intelligence truly shines!
[dead]
[flagged]
[flagged]
Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.
And open weight too! So grateful for this.