This is great, and I'm not knocking it, but every time I see these apps it reminds me of my phone.
My 2021 Google Pixel 6, when offline, can transcribe speech to text, and also corrects things contextually. it can make a mistake, and as I continue to speak, it will go back and correct something earlier in the sentence. What tech does Google have shoved in there that predates Whisper and Qwen by five years? And why do we now need a 1Gb of transformers to do it on a more powerful platform?
Microsoft OneNote had this back in 2007 or so, granted the speech to text model wasn't nearly as advanced as they are now.
I was actually on the OneNote team when they were transitioning to an online only transcription model because there was no one left to maintain the on device legacy system.
It wasn't any sort of planned technical direction, just a lack of anyone wanting to maintain the old system.
The accuracy is much lower though.
I've switched away from Gboard to Futo on Android and exclusively use MacWhisper on MacOS instead of the default Apple transcription model.
Interesting. My Pixel 7 transcription is barely usable for me. Makes way too many mistakes and defeats the purpose of me not having to type, but maybe that's just my experience.
The latest open source local STT models people are running on devices are significantly more robust (e.g. whisper models, parakeet models, etc.). So background noise, mumbling, and/or just not having a perfect audio environment doesn't trip up the SoTA models as much (all of them still do get tripped up).
I work in voice AI and am using these models (both proprietary and local open source) every day. Night and day different for me.
macOS and iOS can do that to with the baked in dictation. Globe key + D on Mac
IMO.. one of the best. It was surprisingly good. Yet they can't even replicate in on their own systems
It's the same model used for the WebSpeech API, which can operate entirely offline.
Google mostly funded the training of this model around 10 years ago, and it's quite good.
There are many websites that are simple frontends for this model which is built into Webkit and Blink based browsers. However to my knowledge the model is a blob packed into the apps which is not open source, hence the no Firefox support.
https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...
https://www.google.com/intl/en/chrome/demos/speech.html