Rolling your own serverless OCR in 40 lines of code

103 points • by mpcsb • last Thursday at 1:23 PM • 51 comments • view on HN

Comments

Not sure what “your own” in the title is supposed to mean if you are running a model that you didn’t train using a framework that you didn’t write on a server that you don’t own.

➕ show 5 replies

voidUpdate • today at 12:44 PM

Wouldn't "Serverless OCR" mean something like running tesseract locally on your computer, rather than creating an AI framework and running it on a server?

➕ show 4 replies

kbyatnal • today at 1:30 PM

Deepseek OCR is no longer state of the art. There are much better open source OCR models available now.

ocrarena.ai maintains a leaderboard, and a number of other open source options like dots [1] or olmOCR [2] rank higher.

[1] https://www.ocrarena.ai/compare/dots-ocr/deepseek-ocr

[2] https://www.ocrarena.ai/compare/olmocr-2/deepseek-ocr

➕ show 3 replies

brainless • today at 2:14 PM

I am working on a client project, originally built using Google Vision APIs, and then I realized Tesseract is so good. Like really good. Also, if PDF text is available, then pdftotext tools are awesome.

My client's usecase was specific to scanning medical reports but since there are thousands of labs in India which have slightly different formats, I built an LLM agent which works only after the pdf/image to text process - to double check the medical terminology. That too, only if our code cannot already process each text line through simple string/regex matches.

There are perhaps extremely efficient tools to do many of the work where we throw the problem at LLMs.

grimgrin • today at 3:02 PM

hi. i run "ocr" with dmenu on linux, that triggers maim where i make a visual selection. a push notification shows the body (nice indicator of a whiff), but also it's on my clipboard

  #!/usr/bin/env bash

  # requires: tesseract-ocr imagemagick maim xsel

  IMG=$(mktemp)
  trap "rm $IMG*" EXIT

  # --nodrag means click 2x
  maim -s --nodrag --quality=10 $IMG.png

  # should increase detection rate
  mogrify -modulate 100,0 -resize 400% $IMG.png

  tesseract $IMG.png $IMG &>/dev/null
  cat $IMG.txt | xsel -bi
  notify-send "Text copied" "$(cat $IMG.txt)"

  exit

jbs789 • today at 9:02 PM

Why "rolling"? Is this a reference to baking or what's the origin?

Bishonen88 • today at 2:45 PM

Tried adding a receipt itemization feature into an app using OpenAI. It does 95% right but the remaining 5% are a mess. Mostly it mixes prices between items (Olive oil 0.99 while Banana 7.99). Is there some lightweight open source lib that can do this better?

lkm0 • today at 3:06 PM

So I'm trying to OCR 1000s of pages of old french dictionaries from the 1700s, has anything popped up that doesn't cost an arm and a leg, and works pretty decently?

➕ show 3 replies

coolness • today at 1:06 PM

Slight tangent: i was wondering why DeepSeek would develop something like this. In the linked paper it says

> In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G).

That... doesn't sound legal

➕ show 1 reply

apwheele • today at 1:18 PM

Question for the crowd -- with autoscaling, when a new pod is created it will still download the model right from huggingface?

I like to push everything into the image as much as I can. So in the image modal, I would run a command to trigger downloading the model. Then in the app just point to the locally downloaded model. So bigger image, but do not need to redownload on start up.

➕ show 1 reply

bovinejoni • today at 2:50 PM

That book is freely available from its author in pdf format already… but I guess it’s about the journey?

➕ show 2 replies

sails • today at 2:26 PM

Always wondered how auth validation works on these. Could I use your serverless ocr?

ddtaylor • today at 1:03 PM

How does this compare to Tesserect?

➕ show 1 reply

fzysingularity • today at 4:38 PM

The cold-boot time on this model can hardly be called “serverless”

PlatoIsADisease • today at 5:37 PM

Uh... So I've been telling AI to write a single page html/js OCR app. And I'll include the pdf I want as an attachment.

I have 4 of these now, some are better than others. But all worked great.

zeroq • today at 2:12 PM

tl'dr version:

  step 1 draw a circle
  step 2 import the rest of the owl

alt Hacker News

Rolling your own serverless OCR in 40 lines of code

Comments