In my experience it's actually very doable to do reliable tool calling with a generic response format across models. You just need to disable native tool calling completely and provide a clearly defined response/tool format that conforms well to pretraining across a variety of models (e.g. XML-like syntaxes).
For example: ``` <think>Let me take a look at that</think> <read path="foo.txt"/> ```
The hard part is building a streaming XML parser that handles these responses robustly, can adjust for edge cases, and normalizes predictable mishaps in history in order to ensure continued response format adherance.