I'm surprised at the low rate every model manages considering the (apparent) ease of the benchmarked document. Can your pipeline produce ground truth as a byproduct ? How do you think open-weight ocr models compare to the one showcased ? I've had good results with glm-ocr on complex documents (complex by their handwriting, pretty easy layouts).
What I like about your solution is the traceability of the information. A scruffy pipeline I used was gemini-flash 3.0 to pdf to notebook-lm (really amateurish work i know), but it yielded tremendeous time gains to extract info from documents (that could be borderline impossible to read for me). However, to trace back the info was obviously very tedious. But from my experience, notebooklm can now manage ocr/htr without a third party. I wonder how competitive your solution might be compared to messy workflows that work -- albeit with efforts -- but let's the researcher be "in contact" with the material.
What I really want is obviously an easy to setup local rag system, with the (very) light model that goes with it ... sweet dream.
We were also surprised at first. The reason the models don't do so well is that they need to find information across 90k pages. When they are pointed to the right location they tend to do much better. And with these treasury documents grepping / keyword searching is almost impossible because everything appears thousands of times.
And thank you, we also love the traceability, it's one of the aspects that we have prioritized. Models will never be perfect so rather than building the best model harness we went for the best human harness haha.
Tbh it's been a while since I've looked at notebooklm so I expect it would have gotten better over time. One thing where I found it lacking in the past was the structure we could get out (which gives the traceability) - for example a deep dive on one the underlying data for this corpus: https://demo.parsewise.ai/projects/39bee9d8-d722-4b23-8894-e...
And yes, we're really excited whenever new open weights models come out that push quality, price, latency. We're finding that throughput is a big obstacle so I'm looking forward to more of this running locally, but it will be a while..