Agreed that “unstructured arbitrary phone calls + arbitrary actions” is where things go to die.
What does work in production (at least for SMB/customer-support style calls) is making the problem less magical: 1) narrow domain + explicit capabilities (book/reschedule/cancel, take a message, basic FAQs) 2) strict tool whitelist + typed schemas + confirmations for side effects 3) robust out-of-scope detection + graceful handoff (“I can’t do that, but I can X/Y/Z”) 4) real logs + eval/test harnesses so regressions get caught
Once you do that, you can get genuinely useful outcomes without the role-play traps you’re describing.
We’ve been building this at eboo.ai (voice agents for businesses). If you’re curious, happy to share the guardrails/eval setup we’ve found most effective.