logoalt Hacker News

espadrineyesterday at 5:50 PM0 repliesview on HN

It is quite impressive.

I have seen the same impressive performance about 7 months ago here: https://kyutai.org/stt

If I look at the architecture of Voxtral 2, it seems to take a page from Kyutai’s delayed stream modeling.

The reason the delay is configurable is that you can delay the stream by a variable number of audio tokens. Each audio token is 80 ms of audio, converted to a spectrogram, fed to a convnet, passed through a transformer audio encoder, and the encoded audio embedding is passed, with a history of 1 audio embedding per 80 ms, into a text transformer, which outputs text embedding, then converted to a text token (which is thus also worth 80ms, but there is a special [STREAMING_PAD] token to skip producing a word).

There is no cross-attention in either Kyutai's STT nor in Voxtral 2, unlike Whisper's encoder-decoder design!