I would try Qwen-ASR: https://qwen.ai/blog?id=qwen3asr
See the very bottom of the page for a transcription with timestamps.