> We use agents to navigate the app, making real-time decisions based on its state.
This still leads me to my original question of how though. If you're not using locators are you just passing page contents to the LLM? Or using a multi modal model and say screenshotting? My experience with that has been pretty poor and worse than proper e2e scripts, and is fairly expensive to boot.
Sorry for the insistence haha, just interested because it could be pretty groundbreaking if done well.