In our benchmarks we exclusively use a custom harness for measuring tool capability. It has common tools that any harness would have, like a thin wrapper around shell commands, basic file editors, etc. but an important part of agentic intelligence is adapting to new tools. Frontier models are already quite adaptable, especially Anthropic models, and improving with each release. I think a standardized format will become less and less important over time.
Benchmarks at https://gertlabs.com
This is backwards. If you think the models are capable of adapting to any format, they will have an easier time adapting to more popular and more common formats until they will eventually become de-facto standards.
The only case where a standard wouldn't win is the case where models are only capable of supporting the baked in format but even this could be solved by adopting a standard format.