logoalt Hacker News

Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3

308 pointsby petewardenyesterday at 9:54 PM71 commentsview on HN

I wanted to share our new speech to text model, and the library to use them effectively. We're a small startup (six people, sub-$100k monthly GPU budget) so I'm proud of the work the team has done to create streaming STT models with lower word-error rates than OpenAI's largest Whisper model. Admittedly Large v3 is a couple of years old, but we're near the top the HF OpenASR leaderboard, even up against Nvidia's Parakeet family. Anyway, I'd love to get feedback on the models and software, and hear about what people might build with it.


Comments

Karrot_Kreamyesterday at 11:57 PM

According to the OpenASR Leaderboard [1], looks like Parakeet V2/V3 and Canary-Qwen (a Qwen finetune) handily beat Moonshine. All 3 models are open, but Parakeet is the smallest of the 3. I use Parakeet V3 with Handy and it works great locally for me.

[1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

show 7 replies
T0mSIlvertoday at 9:48 AM

Congrats on the results. The streaming aspect is what I find most exciting here.

I built a macOS dictation app (https://github.com/T0mSIlver/localvoxtral) on top of Voxtral Realtime, and the UX difference between streaming and offline STT is night and day. Words appearing while you're still talking completely changes the feedback loop. You catch errors in real time, you can adjust what you're saying mid-sentence, and the whole thing feels more natural. Going back to "record then wait" feels broken after that.

Curious how Moonshine's streaming latency compares in practice. Do you have numbers on time-to-first-token for the streaming mode? And on the serving side, do any of the integration options expose an OpenAI Realtime-compatible WebSocket endpoint?

francislavoietoday at 1:40 AM

I've helped many Twitch streamers set up https://github.com/royshil/obs-localvocal to plug transcription & translation into their streams, mainly for German audio to English subtitles.

I'd love a faster and more accurate option than Whisper, but streamers need something off-the-shelf they can install in their pipeline, like an OBS plugin which can just grab the audio from their OBS audio sources.

I see a couple obvious problems: this doesn't seem to support translation which is unfortunate, that's pretty key for this usecase. Also it only supports one language at a time, which is problematic with how streamers will frequently code-switch while talking to their chat in different languages or on Discord with their gameplay partners. Maybe such a plugin would be able to detect which language is spoken and route to one or the other model as needed?

Ross00781today at 6:46 PM

Open-weight STT models hitting production-grade accuracy is huge for privacy-sensitive deployments. Whisper was already impressive, but having competitive alternatives means we're not locked into a single model family. The real test will be multilingual performance and edge device efficiency—has anyone benchmarked this on M-series or Jetson?

heftykootoday at 3:02 AM

Claiming higher accuracy than Whisper Large v3 is a bold opening move. Does your evaluation account for Whisper's notorious hallucination loops during silences (the classic 'Thank you for watching!'), or is this purely based on WER on clean datasets? Also, what's the VRAM footprint for edge deployments? If it fits on a standard 8GB Mac without quantization tricks, this is huge.

asqueellayesterday at 11:53 PM

For those wondering about the language support, currently English, Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, Vietnamese are available (most in Base size = 58M params)

fareeshtoday at 12:43 AM

Accuracy is often presumed to be english, which is fine, but it's a vague thing to say "higher" because does it mean higher in English only? Higher in some subset of languages? Which ones?

The minimum useful data for this stuff is a small table of language | WER for dataset

ac29yesterday at 11:28 PM

No idea why 'sudo pip install --break-system-packages moonshine-voice' is the recommended way to install on raspi?

The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run)

show 1 reply
nmstokertoday at 12:39 AM

Any plans regarding JavaScript support in the browser?

There was an issue with a demo but it's missing now. I can't recall for sure but I think I got it working locally myself too but then found it broke unexpectedly and I didn't manage to find out why.

RobotToastertoday at 5:47 AM

> Models for other languages are released under the Moonshine Community License, which is a non-commercial license.

Weird to only release English as open weights.

show 1 reply
dagsstoday at 5:52 AM

Very exciting stuff!

    hear about what people might build with it
My startup is making software for firefighters to use during missions on tablets, excited to see (when I get the time) if we can use this as a keyboard alternative on the device. It's a use case where avoiding "clunky" is important and a perfect usecase for speech-to-text.

Due to the sector being increasingly worried about "hybrid threats" we try to rely on the cloud as little as possible and run things either on device or with the possibility of being self-hosted/on-premise. I really like the direction your company is going in in this respect.

We'd probably need custom training -- we need Norwegian, and there's some lingo, e.g., "bravo one two" should become "B-1.2". While that can perhaps also be done with simple post-processing rules, we would also probably want such examples in training for improved recognition? Have no VC funding, but looking forward to getting some income so that we can send some of it in your direction :)

show 1 reply
guerythontoday at 3:15 AM

Nice work. One metric I’d really like to see for streaming use cases is partial stability, not just final WER.

For voice agents, the painful failure mode is partials getting rewritten every few hundred ms. If you can share it, metrics like median first-token latency, real-time factor, and "% partial tokens revised after 1s / 3s" on noisy far-field audio would make comparisons much more actionable.

If those numbers look good, this seems very promising for local assistant pipelines.

show 1 reply
sourcetmstoday at 8:06 AM

I'm offering support for this in Resonant - Already set up and running this week.

It's incredible for a live transcription stream - the latency is WOW.

https://www.onresonant.com/

For the open source folks, that's also set up in handy, I think.

show 1 reply
fudged71today at 6:00 PM

If it's using ONNX, can this be ported to Transformers.js?

armcatyesterday at 11:26 PM

This is awesome, well done guys, I’m gonna try it as my ASR component on the local voice assistant I’ve been building https://github.com/acatovic/ova. The tiny streaming latencies you show look insane

fittingoppositetoday at 1:10 PM

Which program does support it to allow streaming? Currently using spokenly and parakeet but would like to transition to a model that is streaming instead of transcribing chunk wise.

Ross00781today at 6:44 AM

The streaming architecture looks really promising for edge deployments. One thing I'm curious about: how does the caching mechanism handle multiple concurrent audio streams? For example, in a meeting transcription scenario with 4-5 speakers, would each stream maintain its own cache, or is there shared state that could create bottlenecks?

999900000999today at 12:41 AM

Very cool. Anyway to run this in Web assembly, I have a project in mind

regularfrytoday at 11:39 AM

Oh this is fantastic. I'm most interested to see if this reaches down to the raspberry pi zero 2, because that's a whole new ballgame if it does.

dSebastientoday at 11:01 AM

I've been using Moonshine since V1 and the results are really great. I'd say on par with Parakeet V3 while working really well with CPU only.

binometoday at 7:29 AM

I vibe-trained moonshine-tiny on amateur radio morse code last weekend, and was surprised at the ~2% CER I was seeing in evals and over the air performance was pretty acceptable for a couple hour run on a 4090.

pzoyesterday at 11:43 PM

haven't tested yet but I'm wondering how it will behave when talking about many IT jargon and tech acronyms. For those reason I had to mostly run LLM after STT but that was slowing done parakeet inference. Otherwise had problems to detect properly sometimes when talking about e.g. about CoreML, int8, fp16, half float, ARKit, AVFoundation, ONNX etc.

saltwoundstoday at 1:29 AM

Streaming transcription is crazy fast on an M1. Would be great to use this as a local option versus Wispr Flow.

oezitoday at 3:28 AM

Do you also support timestamps the detected word or even down to characters?

starkparkertoday at 2:46 AM

Implemented this to transcribe voice chat in a project and the streaming accuracy in English on this was unusable, even with the medium streaming model.

g-morkyesterday at 11:32 PM

How does this compare to Parakeet, which runs wonderfully on CPU?

srousseyyesterday at 11:49 PM

onnx models for browser possible?

lostmsuyesterday at 10:55 PM

How does it compare to Microsoft VibeVoice ASR https://news.ycombinator.com/item?id=46732776 ?

raybbtoday at 3:56 AM

fyi the typepad link in your bio is broken

alexnewmantoday at 12:40 AM

If only it did Doric

cyanydeezyesterday at 10:30 PM

No LICENSE no go

show 2 replies
nivcmotoday at 8:13 AM

[dead]

devcraft_aitoday at 8:30 AM

[dead]

aplomb1026today at 12:31 AM

[flagged]

show 1 reply