logoalt Hacker News

corlinptoday at 6:49 PM1 replyview on HN

This is exactly the case today. Multimodal LLMs like gpt-4o-transcribe are way better than traditional ASR, not only because of deeper understanding but because of the ability to actually prompt it with your company's specific terminology, org chart, etc.

For example, if the prompt includes that Caitlin is an accountant and Kaitlyn is an engineer, if you transcribe "Tell Kaitlyn to review my PR" it will know who you're referring to. That's something WER doesn't really capture.

BTW, I built an open-source Mac tool for using gpt-4o-transcribe with an OpenAI API key and custom prompts: https://github.com/corlinp/voibe


Replies

Bolwintoday at 9:24 PM

Many ASR models already support prompts/adding your own terminology. This one doesn't, but full LLMs especially such expensive ones aren't needed for that.