I’m working on a DOM agent and I think MCP is overkill. You have a few “layers” you can imply by just executing some simple JS (eg: visible text, clickable surfaces, forms, etc). 90% of the time, the agent can imply the full functionality, except for the obvious edge cases (which trip up even humans): infinite scrolling, hijacking navigation, etc.
In what world is this simpler than just giving the agent a list of functions it can call?
Do expose the accessibility tree of a website to llms? What do you do with websites that lack that? Some agents I saw use screenshots, but that seems also kind of wasteful. Something in-between would be interesting.
Question: Are you writing this under the assumption that the proposed WebMCP is for navigating websites? If so: It is not. From what I've gathered, this is an alternative to providing an MCP server.
Instead of letting the agent call a server (MCP), the agent downloads javascript and executes it itself (WebMCP).