logoalt Hacker News

Interfaze: A new model architecture built for high accuracy at scale

152 pointsby yoevenyesterday at 4:22 PM36 commentsview on HN

Comments

nickservtoday at 11:32 AM

Gave it a try for structured data extraction. Tested returning a JSON object from images.

The output was correct, and seemed deterministic, although I ran it only 2-3 times on the same image.

Main problem is response time: it took about 20-25 seconds for a simple structure of 5 fields. As such unusable at scale, let alone "real time" processing.

Other problem is cost, it is considerably more expensive than more established models for the same document, like flash-light.

Shame, the architecture is very interesting.

schanzyesterday at 8:52 PM

Amazing!

I just tried the OCR capabilities with a photo of a DIN A4 page which was written with a typewriter. The image isn't the easiest to interpret. The text perspective is distorted because the page is part of a book and the page margin toward the spine of the book is very small. There are also many inline corrections due to typing errors while the page was written (backspace couldn't erase characters back then, and arrow keys couldn't be used to add text in between existing words). Over the past months I've tried to use several LLMs on this very same image already (1 out of 200 pages that seek digitization). The result is by far the most accurate so far. Only some very minor errors (which are also non-trivial for human translators) were made.

This page induced costs of about 25 cent. I assume I could tweak the input image a little more to consume less input tokens. OCR-ing all 200 pages would otherwise cost a juicy 50$ - although there is a generous 20$ of free credits.

Induced cost: 108.8k Input tokens => 16,32 cent 24.5k Output tokens => 8,58 cent

// Edit: I just re-tried the same task utilizing a capability of the API to only run a specific part of the model (e.g. _only_ OCR). This cuts cost by 3x (to ~8c/page) but significantly worsens the result. The result is missing entire lines of the original document. There are also many error in the text that was recognized.

show 3 replies
goktoday at 12:32 AM

Ok that's...just cheating. You can't take a benchmark like MMLU designed to test the performance of a single general language model and compare it to performance of a small specialized model designed to do well on MMLU.

show 1 reply
euroderfyesterday at 6:40 PM

Potentially stupid question: Does that mean we can chain them together line UNIX command line programs ? That would be so, so intuitive.

wood_spirityesterday at 5:42 PM

> These are deep neural network architectures that are task-specific for things like OCR, translation, or GUI detection. The way they consume and see data is trained to be task specific, which makes them up to 100x more accurate at their specific task. They also produce useful metadata like bounding boxes and confidence scores, letting developers build predictable workflows they can rely on.

Does code extraction and manipulation fit in that? Would interfaze be the agent that a coding agent uses?

show 1 reply
bazzmttoday at 6:13 AM

Interesting approach! One question though: can the model do column detection?

The first OCR example returns output that does not detect the article columns - the bounding box is the entire first line.

show 1 reply
andaiyesterday at 7:44 PM

This is very cool, though I don't understand exactly what they've done here. Is it some kind of LLM with convolutional layers added?

The graph doesn't exactly make it clear but it describes a pipeline that goes beyond the LLM, so the CNN could be a separate model there.

show 1 reply
pss314yesterday at 11:24 PM

Interfaze.ai at YC Launch Live - May 8th, 2026 https://youtu.be/S9Lgp2hWBsE?t=4185

fraywingyesterday at 8:05 PM

So is this basically a task-specific MoA transformer arch with a DNN that helps make routing decisions? Trying to understand this.

show 1 reply
sareiodatayesterday at 5:32 PM

Smaller models really arent great at structured output. If this works it would be great for a local model that might not be as good but as long as it respects structured output will be vastly more useful.

show 2 replies
jadboxyesterday at 10:43 PM

Can this run locally or is this a service?

show 1 reply
sweaterkokuroyesterday at 5:30 PM

This is cool, Id love to be able to fine tune on this architecture. Is this something on the roadmap ever?

show 1 reply
floriansyesterday at 8:12 PM

What I want are precise and tight bounding boxes. Why is this so difficult?

show 1 reply
vivzkestreltoday at 3:19 AM

does it handle source code extraction from images?

how do I run it locally?

show 1 reply
icemazeyesterday at 8:07 PM

Great in the benchmarks but not as good in the real world, sorry to say. Just gave it a try in my STT bot, it's worse than whisper

show 1 reply
redwoodyesterday at 7:55 PM

Similar to a large action model?

show 1 reply
a7om_comyesterday at 5:27 PM

[flagged]

qzgrid37yesterday at 11:06 PM

[dead]