Gave it a try for structured data extraction. Tested returning a JSON object from images.
The output was correct, and seemed deterministic, although I ran it only 2-3 times on the same image.
Main problem is response time: it took about 20-25 seconds for a simple structure of 5 fields. As such unusable at scale, let alone "real time" processing.
Other problem is cost, it is considerably more expensive than more established models for the same document, like flash-light.
Shame, the architecture is very interesting.