Show HN: Robust LLM Extractor for Websites in TypeScript

63 points • by andrew_zhong • today at 3:55 AM • 39 comments • view on HN

We've been building data pipelines that scrape websites and extract structured data for a while now. If you've done this, you know the drill: you write CSS selectors, the site changes its layout, everything breaks at 2am, and you spend your morning rewriting parsers.

LLMs seemed like the obvious fix — just throw the HTML at GPT and ask for JSON. Except in practice, it's more painful than that:

- Raw HTML is full of nav bars, footers, and tracking junk that eats your token budget. A typical product page is 80% noise. - LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes. - Relative URLs, markdown-escaped links, tracking parameters — the "small" URL issues compound fast when you're processing thousands of pages. - You end up writing the same boilerplate: HTML cleanup → markdown conversion → LLM call → JSON parsing → error recovery → schema validation. Over and over.

We got tired of rebuilding this stack for every project, so we extracted it into a library.

Lightfeed Extractor is a TypeScript library that handles the full pipeline from raw HTML to validated, structured data:

- Converts HTML to LLM-ready markdown with main content extraction (strips nav, headers, footers), optional image inclusion, and URL cleaning - Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.) - Uses Zod schemas for type-safe extraction with real validation - Recovers partial data from malformed LLM output instead of failing entirely — if 19 out of 20 products parsed correctly, you get those 19 - Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches - Pairs with our browser agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction

We use this ourselves in production at Lightfeed, and it's been solid enough that we decided to open-source it.

GitHub: https://github.com/lightfeed/extractor npm: npm install @lightfeed/extractor Apache 2.0 licensed.

Happy to answer questions or hear feedback.

Comments

plastic041 • today at 4:57 AM

> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.

And it doesn't care about robots.txt.

➕ show 1 reply

sheept • today at 5:43 AM

> LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.

This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.

➕ show 3 replies

Flux159 • today at 5:09 AM

This looks pretty interesting! I haven't used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that's a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?

Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?

Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.

➕ show 1 reply

l3x4ur1n • today at 12:10 PM

Would this work for my use case?

I need to extract article content, determine it's sentiment towards a keyword and output a simple json with article name, url, sentiment and some text around the found keyword.

Currently I'm having problems with the json output, it's not reliable enough and produces a lot of false json.

letier • today at 7:57 AM

The extraction prompt would need some hardening against prompt injection, as far as i can tell.

vetler • today at 9:11 AM

My instinct was also to use LLMs for this, but it was way to slow and still expensive if you want to scrape millions of pages.

➕ show 1 reply

spiderfarmer • today at 9:36 AM

My platform has 24M pages on 8 domains and these NASTY crawlers insist on visiting every single one of them. For every 1 real visitor there are at least 300 requests from residential proxies. And that's after I blocked complete countries like Russia, China, Taiwan and Singapore.

Even Cloudflares bot filter only blocks some of them.

I'm using honeypot URLs right now to block all crawlers that ignore rel="nofollow", but they appear to have many millions of devices. I wouldn't be surprised if there are a gazillion residential routers, webcams and phones that are hacked to function as a simple doorways.

Things are really getting out of hand.

➕ show 1 reply

dmos62 • today at 5:57 AM

What's your experience with not getting blocked by anti-bot systems? I see you've custom patches for that.

➕ show 1 reply

AirMax98 • today at 6:39 AM

This feels like slop to me.

It may or may not be, but if you want people to actually use this product I’d suggest improving your documentation and replies here to not look like raw Claude output.

I also doubt the premise that about malformed JSON. I have never encountered anything like what you are describing with structured outputs.

➕ show 1 reply

zx8080 • today at 5:28 AM

Robots.txt anyone?

➕ show 1 reply

paxrel_ai • today at 2:01 PM

[dead]

Remi_Etien • today at 5:11 AM

[dead]

hikaru_ai • today at 7:06 AM

[dead]

johnwhitman • today at 6:00 AM

[dead]

openclaw01 • today at 6:27 AM

[dead]

moci • today at 1:11 PM

[dead]

gautamborad • today at 4:39 AM

[dead]

warwickmcintosh • today at 10:13 AM

[dead]

ferreyadinarta • today at 1:00 PM

[dead]

chattermate • today at 9:45 AM

[flagged]

alt Hacker News

Show HN: Robust LLM Extractor for Websites in TypeScript

Comments