Rust implementation of Mistral's Voxtral Mini 4B Realtime runs in your browser

379 points • by Curiositry • today at 1:26 AM • 54 comments • view on HN

Comments

If folks are interested, @antirez has opened a C implementation of Voxtral Mini 4B here: https://github.com/antirez/voxtral.c

I have my own fork here: https://github.com/HorizonXP/voxtral.c where I’m working on a CUDA implementation, plus some other niceties. It’s working quite well so far, but I haven’t got it to match Mistral AI’s API endpoint speed just yet.

➕ show 2 replies

simonw • today at 6:15 AM

I tried the demo and it looks like you have to click Mic, then record your audio, then click "Stop and transcribe" in order to see the result.

Is it possible to rig this up so it really is realtime, displaying the transcription within a second or two of the user saying something out loud?

The Hugging Face server-side demo at https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim... manages that, but it's using a much larger (~8.5GB) server-side model running on GPUs.

➕ show 1 reply

mentalgear • today at 6:27 AM

Kudos, this is were it's add: open-models running on-premise. Preferred by users and businesses. Glad Mistral's got that figured out.

➕ show 1 reply

rglover • today at 7:18 PM

Naive, semi-related question: what is the state of stuff like Mistral when compared to OpenAI, Anthropic, etc?

Could I reasonably use this to get LLM-capability privately on a machine (and get decent output), or is it still in the "yeah it does the thing, but not as well as the commercial stuff" category?

Jayakumark • today at 3:43 AM

Awesome work, Would be good to have it work with handy.computer. Also are there plans to support streaming ?

➕ show 2 replies

radarsat1 • today at 3:00 PM

It's cool but do I really want a single browser tab downloading 2.5 GB of data and then just leaving it to be ephemerally deleted? I know the internet is fast now and disk space is cheap but I have trouble bringing myself around to this way of doing things. It feels so inefficient. I do like the idea of client-side compute, but I feel like a model (or anything) this big belongs on the server.

➕ show 2 replies

zaptheimpaler • today at 7:52 AM

I don't know anything about these models, but I've been trying Nvidia's Parakeet and it works great. For a model like this that's 9GB for the full model, do you have to keep it loaded into GPU memory at all times for it to really work realtime? Or what's the delay like to load all the weights each time you want to use it?

➕ show 2 replies

arkensaw • today at 10:06 AM

Look I think its great that it runs in the browser and all, but I don't want to live in a world where its normal for a website to download 2.5Gb in the background to run something

➕ show 3 replies

Retr0id • today at 3:20 AM

hm, seems broken on my machine (Firefox, Asahi Linux, M1 Pro). I said hello into the mic, and it churned for a minute or so before giving me:

panorama panorama panorama panorama panorama panorama panorama panorama� molest rist moundothe exh� Invothe molest Yan artist�� Yan Yan Yan Yan Yanothe Yan Yan Yan Yan Yan Yan Yan

➕ show 2 replies

mikebelanger • today at 12:44 PM

Neat, and neat to see the burn framework getting used. I tried this on latest Chromium, but my system froze until my OS killed Chromium. My VPN connection died right after downloading the model too. (it doesn't have a bandwidth cap either, so I'm not sure what's happening)

boutell • today at 3:52 PM

This stuff is cool. So is whisper. But I keep hoping for something that can run close to real time on a Raspberry Pi Zero 2 with a reasonable English vocabulary.

Right now everything is either archaic or requires too much RAM. CPU isn't as big of an issue as you'd think because the pi zero 2 is comparable to a pi 3.

scronkfinkle • today at 1:12 PM

Nice!

I'm interested in your cubecl-wgpu patches. I've been struggling to get lower than FP32 safetensor models working on burn, did you write the patches to cubecl-wgpu to get around this restriction, to add support for GGUF files, or both?

I've been working on something similar, but for whisper and as a library for other projects: https://github.com/Scronkfinkle/quiet-crab

fusslo • today at 1:54 PM

I wonder if there's a metric or measure of how much jargon goes into a README or other document.

Reading the first three sentences of this README. 43 words, I would consider 15 terms to be jargon incomprehensible to the layman.

➕ show 1 reply

ubixar • today at 10:11 AM

For those exploring browser STT, this sits in an interesting space between Whisper.wasm and the Deepgram KC client. The 2.5GB quantized footprint is notably smaller than most Whisper variants — any thoughts on accuracy tradeoffs compared to Whisper base/small?

explosion-s • today at 1:37 PM

Just curious, is there any smaller version of this model capable of running on edge devices? Even my Mac M1 with 8gb ram couldn't run the C version.

➕ show 2 replies

another_twist • today at 2:16 PM

Uggh. I had just started working on this. Congratulations to the author !

jszymborski • today at 4:00 AM

Man, I'd love to fine-tune this, but alas the huggingface implementation isn't out as far as I can tell.

misiek08 • today at 8:39 AM

(no speech detected)

or... not talking anything generate random German sentences.

Nathanba • today at 4:29 AM

I just tried it, I said "what's up buddy, hey hey stop" and it transcribed this for me: " وطبعا هاي هاي هاي ستوب" No, I'm not in any arabic or middle eastern country. The second test was better, it detected english.

➕ show 1 reply

TZubiri • today at 12:17 PM

Impressive, but to state the obvious, this is not yet practical for browser use due to it's (at least) 2.5GB memory footprint

refulgentis • today at 4:49 AM

Notable this isn't even close to realtime. M4 Max.

sergiotapia • today at 3:02 AM

>init failed: Worker error: Uncaught RuntimeError: unreachable

Anything I can do to fix/try it on Brave?

➕ show 2 replies

alt Hacker News

Rust implementation of Mistral's Voxtral Mini 4B Realtime runs in your browser

Comments