The latest rounds of open weights vision language models are incredibly good. Like, massively good. Open weights vision capabilities trade blows with frontier models. Over the last few months I'd roughly rank capabilities as Gemini -> {chatgpt and SoTa open weights models} -> Claude.
qwen3.5-2b and qwen3.5-4b are great at document parsing. They can run on CPU
qwen3.6-27b and gemma4-31b are borderline better than the human eye in some cases. Their OCR isn't perfect, but they're seriously good. They can still run on the CPU but you'll be waiting minutes per document.
You can demand JSON, YAML, MD, or freeform text just by varying the prompt. Even if you have a custom template, you can just put that in the prompt and they'll do an OK-ish job.
There's also models that aren't in the r/locallama zeitgeist. IBM released a new 4b parameter model for structured text extraction last week, and there's a sea of recent chinese OCR models too.
IMO the open wights models are so good that in a lot of cases it's not worth paying frontier labs for OCR purposes. The only barrier to entry is the effort to set up a pipeline, and havin the spare CPU/GPU capacity.