I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites tha...

jasongill • yesterday at 11:08 PM • 8 replies • view on HN

I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy - something like https://www.example.com/cdn-cgi/cached-contents.json They already have the website content in their cache, so why not just cut out the middle man of scraping services and API's like this and publish it?

Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.

Replies

cortesoft • today at 4:35 AM

Well, the conversion process into the JSON representation is going to take CPU, and then you have to store the result, in essence doubling your cache footprint.

Doing it on demand still utilizes their cached version, so it saves a trip to the origin, but doesn’t require doubling the cache size. They can still cache the results if the same site is scraped multiple times, but this saves having to cache things that are never going to be requested.

Cache footprint management is a huge factor in the cost and performance for a CDN, you want to get the most out of your storage and you want to serve as many pages from cache as possible.

I know in my experience working for a CDN, we were doing all sorts of things to try to maximize the hit rate for our cache.. in fact, one of the easiest and most effective techniques for increasing cache hit rate is to do the OPPOSITE of what you are suggesting; instead of pre-caching content, you do ‘second hit caching’, where you only store a copy in the cache if a piece of content is requested a second time. The idea is that a lot of content is requested only once by one user, and then never again, so it is a waste to store it in the cache. If you wait until it is requested a second time before you cache it, you avoid those single use pages going into your cache, and don’t hurt overall performance that much, because the content that is most useful to cache is requested a lot, and you only have to make one extra origin request.

➕ show 1 reply

selcuka • today at 12:24 AM

Not the same thing, but they have something close (it's not on-by-default, yet) [1]:

> Cloudflare's network now supports real-time content conversion at the source, for enabled zones using content negotiation headers. Now when AI systems request pages from any website that uses Cloudflare and has Markdown for Agents enabled, they can express the preference for text/markdown in the request. Our network will automatically and efficiently convert the HTML to markdown, when possible, on the fly.

[1] https://blog.cloudflare.com/markdown-for-agents/

➕ show 1 reply

michaelmior • yesterday at 11:50 PM

> I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy

It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public.

➕ show 2 replies

hrmtst93837 • today at 8:54 AM

Offering wholesale cache dumps blows up every assumption about origin privacy and copyright. Suddenly you are one toggle away from someone else automatically harvesting and reselling your work with Cloudflare as the unwitting middle tier.

You could try to gate this behind access controls but at that point you have reinvented a clunky bespoke CDN API that no site owner asked for, plus a fresh legal mess. Static file caches work because they only ever respond to the original request, not because they claim to own or index your content.

It is a short path from "helpful pre-scraped JSON" to handing an entire site to an AI scraper-for-hire with zero friction. The incentives do not line up unless you think every domain on Cloudflare wants their content wholesale exported by default.

➕ show 1 reply

cmsparks • yesterday at 11:33 PM

That would prolly work for simple sites, but you still need the dedicated scraping service with a browser to render sites that are more complex (i.e. SPAs)

brookst • today at 11:42 AM

That was my first thought when I read the headline. It would make perfect sense, and would allow some websites to have best of both worlds: broadcasting content without being crushed by bots. (Not all sites want to broadcast, but many do).

csomar • today at 12:01 AM

It’s a bit more complicated than that. This is their product Browser Rendering, which runs a real browser that loads the page and executes JavaScript. It’s a bit more involved than a simple curl scraping.

➕ show 1 reply

Fokamul • today at 12:37 PM

But think about poor phishers and malware devs protected by Cloudflare.

alt Hacker News

Replies