This demo is really impressive:

simonw • last Wednesday at 4:21 PM • 19 replies • view on HN

This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...

Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.

I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:

> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?

Replies

tekacs • last Wednesday at 4:41 PM

Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.

And open weight too! So grateful for this.

➕ show 2 replies

daemonologist • last Wednesday at 4:48 PM

404 on https://mistralai-voxtral-mini-realtime.hf.space/gradio_api/... for me (which shows up in the UI as a little red error in the top right).

➕ show 1 reply

Oras • last Wednesday at 4:25 PM

Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.

I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.

➕ show 2 replies

skykooler • last Wednesday at 7:24 PM

Doesn't seem to work for me - tried in both Firefox and Chromium and I can see the waveform when I talk but the transcription just shows "Awaiting audio input".

➕ show 2 replies

espadrine • yesterday at 5:50 PM

It is quite impressive.

I have seen the same impressive performance about 7 months ago here: https://kyutai.org/stt

If I look at the architecture of Voxtral 2, it seems to take a page from Kyutai’s delayed stream modeling.

The reason the delay is configurable is that you can delay the stream by a variable number of audio tokens. Each audio token is 80 ms of audio, converted to a spectrogram, fed to a convnet, passed through a transformer audio encoder, and the encoded audio embedding is passed, with a history of 1 audio embedding per 80 ms, into a text transformer, which outputs text embedding, then converted to a text token (which is thus also worth 80ms, but there is a special [STREAMING_PAD] token to skip producing a word).

There is no cross-attention in either Kyutai's STT nor in Voxtral 2, unlike Whisper's encoder-decoder design!

jaggederest • last Wednesday at 5:03 PM

It can transcribe Eminem's Rap God fast sequence, really, really impressive.

➕ show 2 replies

carbocation • last Wednesday at 6:10 PM

This model was able to transcribe Bad Bunny lyrics over the sound of the background music, played casually from my speakers. Impressive, to me.

➕ show 1 reply

pyprism • last Wednesday at 5:18 PM

Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.

➕ show 1 reply

GolDDranks • last Wednesday at 11:59 PM

I can't get that demo to work. Tried with both Firefox and Chrome.

➕ show 1 reply

sheepscreek • last Wednesday at 6:04 PM

I’ve been using AquaVoice for real-time transcription for a while now, and it has become a core part of my workflow. It gets everything, jargon, capitalization, everything. Now I’m looking forward to doing that with 100% local inference!

Barbing • last Wednesday at 8:46 PM

Doesn’t seem to work in Safari on iOS 26.2, iPhone 17 Pro, just about anything extra disabled.

➕ show 1 reply

darkwater • last Wednesday at 9:02 PM

It's really nice although I've got a sentence in French when I was speaking Italian but I corrected myself in the middle of a word.

But I'm definitely going to keep an eye on this for local-only TTS for Home Assistant.

rafram • last Wednesday at 5:35 PM

Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.

➕ show 1 reply

mentalgear • last Wednesday at 9:30 PM

Here European Multilingual-Intelligence truly shines!

colordrops • last Wednesday at 10:02 PM

is this demo running fully in the browser?

➕ show 1 reply

th0ma5 • last Wednesday at 4:52 PM

[dead]

adarsh2321 • last Wednesday at 5:20 PM

[flagged]

adarsh2321 • last Wednesday at 5:26 PM

[flagged]

alt Hacker News

Replies