Given that it's one-to-one audio and text tokens, you'd get mid-sentence pauses if you just stopped feeding it.