That's interesting because my experience has been almost the opposite. A few months ago I tested Gemini on converting screenshots of tables from PDF files into CSV. I tried it on several different tables and it got every one right. It consistently outperformed ChatGPT.
Tangentially related question. Has anyone analyzed if the content that is being converted could break the model.
So let's say you have a super dull pdf ( or even a scan ) that has the same line over and over again, could this get the model into one of those loops that just keep spewing nonsense.
And thinking that further, could someone prompt inject a model with a handwritten note that only gets "activated" once it's in the context?
anyone who has used both knows this is inaccurate or dishonestly stated (ie. you were using gpt nano or some nonsense)
The key here is that you used screenshots. This forces Gemini into "OCR mode" (i.e. actually looking at vision tokens) rather than trying to be clever with its tool calls.
The latter strategy almost entirely depends on the quality of the skills and tool calls exposed to a given agent.