This is great! I thought of doing something like this for Karaoke, but was wondering about the copyright implications of doing it server-side.
We already do this for ingesting podcasts and cutting their clips with text being highlighted as people speak. AssemblyAI also supports speaker diarization.
For videos recorded using our own livestreaming studio, we can bypass all this by using Web STT and TTS APIs resulting in perfect timing and diarization without the need for server side models.
It's problematic even client side since you don't have a sync license to show words timed to the song. A bunch of other licenses are needed too for the lyrics themselves and to process the original file into the instrumental.