I'm running a service in production using Gemma 4 models, to get structured JSON output back from web search tool calls using Unsloth Studio and its API, but it did require a rather large and detailed system prompt and tool call healing if the format wasn't JSON for example (retries, reprompting with feeding the error back into the model, etc, this is also what Unsloth Studio does for its self-healing tool call feature). But once I did that, it's been working quite well and on benchmarks I've made, it's about 97% accurate after the first time and basically 100% accurate after retries.
This is running on a server though, not sure how well it'd work on a phone, I should try that. I used AI Edge Gallery on Android and it doesn't seem too good at the web search tool but maybe the web search tool itself, being a community made tool, is pretty bad, because tool calling via Unsloth Studio seems to work just fine with the exact same Gemma models on desktop/server vs the phone.
I agree that the web search tool probably is pretty bad. However a smart model would never hallucinate impossible weather data if the search tool failed.
I'm sure you can get some out of it if you babysit it with an optimized prompt, harness, etc and you can tolerate some failures. But when I try to run the ChatGPT prompts from my history, even if I pick the easier ones, it's hopeless.
I'd like to have a local agent on the phone with wikipedia level knowledge. But you probably need more like 30B params.