> At approximately 4% word error rate on FLEURS and $0.003/min
Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/
In English it is pretty good. But talk to it in Polish, and suddenly it thinks you speak Russian? Ukranian? Belarus? I would understand if an American company launched this, but for a company being so proud about their European roots, I think it should have better support for major European languages.
I tried English + Polish:
> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.
Do we know if this is better than Nvidia Parakeet V3? That has been my go-to model locally and it's hard to imagine there's something even better.
The other demos didn't work for me, so I made https://github.com/owenbrown/transcribe It's just a python script to test the streaming.
Wow, Voxtral is amazing. It will be great when someone stitches this up so an LLM starts thinking, researching for you, before you actually finish talking.
Like, create a conversation partner with sub 0.5 second latency. For example, you ask it a multi part questions and, as soon as you finish talking, it gives you the answer to the first part while it looks up the rest of the answer, then stitches it together so that there's no break.
The 2-3 second latency of existing voice chatbots is a non-started for most humans.
I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.
Native diarization, this looks exciting. edit: or not, no diarization in real-time.
https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...
~9GB model.
There's no comparison to Whisper Large v3 or other Whisper models..
Is it better? Worse? Why do they only compare to gpt4o mini transcribe?
Incroyable! Competitive (if not better) than deepgram nova-3, and much better than assembly and elevenlabs in basically all cases on our internal streaming benchmarking.
The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.
Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?
Very happy with all the mistral work. I feel like I'm always one release behind theirs. Last time they released Mistral 3 I commented saying how excited I was to try it out [1]
Well, I'm happy to report I integrated the new Mistral 3 and have been truly astounded by the results. I still am not a big fan of the model wrt factual information - it seems to be especially confident and especially wrong if left to it's own devices - but with http://phrasing.app I do most of the data aggregation myself and just use an LLM to format it. Mistral 3 was a drop-in replacement for 3x the quality (it was already very very good), 0% error rate for my use case (I had an issue for it occasionally going off the rails that was entirely solved), and sticks to my formatting guidelines perfectly (which even gpt-5-pro failed on). Plus it was somehow even cheaper.
I'm using Scribe v2 at the moment for TTS, but I'm very excited now to try integrating Voxtral Transcribe. The language support is a little lacking for my use cases, but I can always fall back to Scribe and amatorize the cost across languages. I actually was due to work on the transcription of phrasing very soon so I guess look forward to my (hopefully) glowing review on their next hn launch! XD
Played with the demo a bit. It's really good at English, and detects language change on the fly. Impressive.
But whatever I tried, it could not recognise my Ukrainian and would default to Russian in absolutely ridiculous transcription. Other STT models recognise Ukrainian consistently, so I assume there is a lot of Russian in training material, and zero Ukrainian. Made me really sad.
It’s nice, but the previous version wasn’t actually that great compared to Parakeet for example.
We need better independent comparison to see how it performs against the latest Qwen3-ASR, and so on.
I can no longer take at face value the cherry picked comparisons of the companies showing off their new models.
For now, NVIDIA Parakeet v3 is the best for my use case, and runs very fast on my laptop or my phone.
things I hate:
"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"
So, you don't mean 'try this out', you mean 'buy this product'.
Let's not act like it's a free sampler.
I can't comment on the model : i'm not giving them money.
Is there an open source Android keyboard that would support it? Everything I find is based on Whisper, which is from 2022. Ages ago given how fast AI is evolving.
Looks like this model doesn't do realtime diarization, what model should I use if I want that? So far I've only seen paid models do diarization well. I heard about Nvidia NeMo but haven't tried that or even where to try it out.
The Apache 2.0 license on Realtime is the buried lede. 4B params at sub-200ms latency means you can run private transcription on-device without sending audio to anyone's servers. That's not an API improvement, it's a categorically different thing.
Is it me or error rate of 3% is really high?
If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.
I really wish those offering speech-to-text models provided transcription benchmarks specific to particular fields of endeavor. I imagine performance would vary wildly when using jargon peculiar to software development, medicine, physics, and law, as compared to everyday speech. Considering that "enterprise" use is often specialized or sub-specialized, it seems like they're leaving money on Dragon's table by not catering to any of those needs.
What's the cheapest device specs that this could realistically run on?
Italian represents, I believe, the most phonetically advanced human language. It has the right compromise among information density, understandability, and ability to speech much faster to compensate the redundancy. It's like if it had error correction built-in. Note that it's not just that it has the lower error rate, but is also underrepresented in most datasets.
It performs well on Mandarin audio transcription, considering it's an European company. It's weird though that it keeps adding spaces between single Chinese characters, and mixing traditional & simplified characters.
Ok, I guess this is the regular time for me to look for a local realtime transcription solution on Linux, and not finding anything good.
Maybe this'll get wrapped into a nice tool later.
Does anyone have any recommendations?
You know what I'd love to have? This running on my Android smartphone. Google's speech services are garbage and they LOVE to cut me off mid-sentence for no reason, well over half the time. It's maddening.
Wondering if most of the AI agents use real time apis or transcription apis.. anyone had experience with building voice agents can comment ?
Very nice! The thing I am missing is turn detection. In real time audio we need the turn detection to understand when AI should speak. Unfortunately this makes it not a complete deepgram replacement yet!
This looks great, but it's not clear to me how to use it for a practical task. I need to transcribe about 10 years worth of monthly meetings. These are government hearings with a variety of speakers. All the videos are on YouTube. What's the most practical and cost-effective way to get reasonably accurate transcripts?
What hardware resources are required for what quality/latency? Multiple high end nvidia or can you run it on your phone on an esp32 offline? Or...
Seems like fundamental info for any model announcement. Did I just miss it? Does everyone just know except me?
Wired advertises this as "Ultra-Fast Translation"[^1]. A bit weird coming from a tech magazine. I hope it's just a "typo".
[^1]: https://www.wired.com/story/mistral-voxtral-real-time-ai-tra...
What's the best way to train this further on a specific dialect or accent or even terminology?
I cant wait for models to get smaller enough that they can run on commodity devices.
Hope we can build an app like Whispr Flow using this with the model running completely on device.
Is there some well established independent benchmark where I can easily (looking at a couple of graphs) compare all popular (especially self-hosted) transcription models?
https://www.tavus.io/post/sparrow-1-human-level-conversation...
how does it compare to sparrow-1?
3 hours for a single request sounds nice to me. Although the graph suggests that it’s not going to perform as good as openai model I have been using, it is open source and surely I will give it a try.
One week ago I was on the hunt for an open source model that can do diatization and I had to literally give up because I could not find any easy to use setup.
This exciting, especially after 11 labs very expensive model
I'm guessing I won't be able to finetune this until they come out with a HF tranformers model, right?
Impressive results, tested on crappy audio files (in french and english)...
Has anyone compared to Deepgram Flux yet for realtime?
does anyone know if there's any desktop tools I can use this transcription model with? e.g. something where like Wisper Flow/WillowVoice but with custom model selection
my struggle with VTT is always the accent. it doesn't understand my English too well because of my non native accent
Any chance Voxtral Mini Transcribe 2 will ever be an open model?
As a rule of thumb for software that I use regularly, it is very useful to consider the costs over a 10-year period in order to compare it with software that I purchase for lifetime to install at home. So that means 1,798.80 $ for the Pro version.
What estimates do others use?
I added it to my bot agent,let’s see how it performs
Cannot wait to try it on Spokenly
Nice. Can this be ran on a mobile device?
Smells Like Teen Spirit survives another challenge!
Voxtral Transcribe 2:
Light up our guns, bring your friends, it's fun to lose and to pretend. She's all the more selfish, sure to know how the dirty world. I wasn't what I'd be best before this gift I think best A little girl is always been Always will until again Well, the lights out, it's a stage And we are now entertainers. I'm just stupid and contagious. And we are now entertainers. I'm a lot of, I'm a final. I'm a skater, I'm a freak. Yeah! Hey! Yeah. And I forget just why I taste it Yeah, I guess it makes me smile I found it hard, it's hard to find the well Whatever, never mind Well, the lights out, it's a stage. You and I are now entertainers. I'm just stupid and contagious. You and I are now entertainers. I'm a lot of, I'm a minor. I'm a killer. I'm a beater. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. And I forget just why I taste it Yeah, I guess it makes me smile I found it hard, it's hard to find the well Whatever, never mind I know, I know, I know, I know, I know Well, the lights out, it's a stage. You and I are now entertainers. I'm just stupid and contagious. You and I are now entertainers. I'm a lot of, I'm a minor. I'm a killer. I'm a beater. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd.
Google/Musixmatch:
Load up on guns, bring your friends It's fun to lose and to pretend She's over-bored, and self-assured Oh no, I know a dirty word Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello With the lights out, it's less dangerous Here we are now, entertain us I feel stupid and contagious Here we are now, entertain us A mulatto, an albino A mosquito, my libido, yeah Hey, yey I'm worse at what I do best And for this gift, I feel blessed Our little group has always been And always will until the end Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello With the lights out, it's less dangerous Here we are now, entertain us I feel stupid and contagious Here we are now, entertain us A mulatto, an albino A mosquito, my libido, yeah Hey, yey And I forget just why I taste Oh yeah, I guess it makes me smile I found it hard, it's hard to find Oh well, whatever, never mind Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello, how low? Hello, hello, hello With the lights out, it's less dangerous Here we are now, entertain us I feel stupid and contagious Here we are now, entertain us A mulatto, an albino A mosquito, my libido A denial, a denial A denial, a denial A denial, a denial A denial, a denial A denial
Really cool.
This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...
Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.
I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:
> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?