logoalt Hacker News

vessenesyesterday at 11:34 AM2 repliesview on HN

A few things about AI-led projects like this come to my mind — first, it’s cool to see all this pulled together. I’m sure the design will read “Claude 2026” soon, but that’s fine - it’s clean and generally has reasonable UX.

There are some real rough spots - for instance, the Latin texts are generated via OCR from scanned documents directly; they’re not from some other scholarly corpus that’s been checked. I only looked at a few, but they all have significant transcription difficulties. Sources are linked, and those sources seem to be archive.org scans. Of course, getting a fluid-sounding translation out of a somewhat shitty transcription is something AI will do for you happily, but it’s harder to get it to tell you where it’s gone off the rails.

That’s not the thing that comes to mind, though. What comes to mind is that projects like this are super useful scaffolding, and I hope it’s built as such. Transcription will get better. Actually I’m pretty sure it could be better now, given the output quality. Translations of better transcriptions will be better. Plus we will likely have higher quality translation tech available.

So, I’d like to see a project like this lean in to that iterative side of this kind of scholarship/hobby/historical work and make versioning and logging of updates part of the interface. Starting in the late 1990s many academic projects did this with large corpuses of documents, (I’m familiar at the least with the Yale Jonathan Edwards project), and used crowd sourced support — there’s no reason not to include facilities that interleave the AI and interested Latin/Roman scholars here.

In my mind with that done, this could turn into a genuinely useful tool. Which would be cool!


Replies

wongarsuyesterday at 5:10 PM

I haven't checked any texts from the 500s. But I did some work with texts from the 1700s. Most of them had terrible transcriptions on archive.org, made using old tesseract versions. You could probably improve a lot with newer tesseract versions. I went for the nuclear option and just passed the image of each page (along with some context on how the previous page ended) to Qwen2.5vl:32b and got near-perfect transcriptions. And as you can tell by the old model that was months ago, vision models only got better.

Of course in some cases vision models are a liability for OCR because the errors they do make are replaced by plausible sounding replacements instead of alphabet soup. But if you only use the transcription as input for an LLM that doesn't matter. It only becomes an issue of how much compute you are willing to throw at it

show 1 reply