My instinct was also to use LLMs for this, but it was way to slow and still expensive if you want to scrape millions of pages.
Put things to perspective - Gemini 2.5 flash is 0.3/1M tokens - assuming each page is 700 tokens and output is not much you are looking at $210 for 1M pages
Put things to perspective - Gemini 2.5 flash is 0.3/1M tokens - assuming each page is 700 tokens and output is not much you are looking at $210 for 1M pages