Your voxtral.c work was a big motivator for me. I built a macOS menu bar dictation app (https://github.com/T0mSIlver/localvoxtral) around Voxtral Realtime, currently using a voxmlx fork with an OpenAI Realtime WebSocket server I added on top.
The thing that sold me on Voxtral Realtime over Whisper-based models for dictation is the causal encoder. Text streaming in as you speak rather than appearing after you stop is a fundamentally different UX. On M1 Pro with a 4-bit quant through voxmlx it feels responsive enough for natural dictation, though I haven't done proper latency benchmarks yet.
Integrating voxtral.c as a backend is on my roadmap, compiling to a single native binary makes it much easier to bundle into a macOS app than a Python-based backend.
> Text streaming in as you speak rather than appearing after you stop is a fundamentally different UX
100%. I don’t understand how people are able to compromise on this.