logoalt Hacker News

wongarsuyesterday at 5:10 PM1 replyview on HN

I haven't checked any texts from the 500s. But I did some work with texts from the 1700s. Most of them had terrible transcriptions on archive.org, made using old tesseract versions. You could probably improve a lot with newer tesseract versions. I went for the nuclear option and just passed the image of each page (along with some context on how the previous page ended) to Qwen2.5vl:32b and got near-perfect transcriptions. And as you can tell by the old model that was months ago, vision models only got better.

Of course in some cases vision models are a liability for OCR because the errors they do make are replaced by plausible sounding replacements instead of alphabet soup. But if you only use the transcription as input for an LLM that doesn't matter. It only becomes an issue of how much compute you are willing to throw at it


Replies

vessenesyesterday at 7:01 PM

Yes, exactly. What could be durable is not the specific transcription as of today - until it’s perfect or at least ‘good enough’ - but the web site, comments, and process that can be run and turn into improved results - that part seems likely to be valuable to me.