An eventual goal is likely to allow interacting with the LLM directly via audio tokens in input/output skipping tts and stt completely.