logoalt Hacker News

general_revealyesterday at 7:18 PM1 replyview on HN

I’ve been thinking about something like this. If it’s just a one line script import, how the heck are you trusting natural language to translate to commands for an arbitrary ui?

The only thing I can think of is you had the AI rewrite and embed selectors on the entire build file and work with that?


Replies

simon_luv_phoyesterday at 7:54 PM

Everything happens at runtime, on the HTML level.

It uses a similiar process as `browser-use` but all in the web page. A script parses the live HTML, strips it down to its semantic essentials (HTML dehydration), and indexes every interactive element. That snapshot goes to the LLM, which returns actions referencing elements by index. The agent then simulates mouse/keyboard events on those elements via JS.

This works best on pages with proper semantic HTML and accessibility markup. You can test it right now on any page using the bookmarklet on the homepage (unless that page CSP blocks script injection of course).

show 1 reply