I recently tried using Qwen VL or Moondream to see if off-the-shelf they would be able to accurately detect most of the interesting UI elements on the screen, either in the browser or your average desktop app.
It was a somewhat naive attempt, but it didn't look like they performed well without perhaps much additional work. I wonder if there are models that do much better, maybe whatever OpenAI uses internally for operator, but I'm not clear how bulletproof that one is either.
These models weren't trained specifically for UI object detection and grounding, so, it's plausible that if they were trained on just UI long enough, they would actually be quite good. Curious if others have insight into this.