I built a browser-based tool for detecting objects in satellite imagery using vision-language models (VLMs). You draw a polygon on the map and enter a text prompt such as "swimming pools", "oil tanks", or "buses". The system scans the selected area tile-by-tile and returns detections projected back onto the map as GeoJSON.
Pipeline: select area and zoom level, split the region into mercantile tiles, run each tile with the prompt through a VLM, convert predicted bounding boxes to geographic coordinates (WGS84), and render the results back on the map.
It works reasonably well for distinct structures in a zero-shot setting. occluded objects are still better handled by specialized detectors like YOLO models.
There is a public demo and no login required. I am mainly interested in feedback on detection quality, performance tradeoffs between VLMs and specialized detectors, and potential real-world use cases.