You're perfectly free to scrape the web yourself and train your own model. You're not free...

arcfour • yesterday at 1:47 PM • 10 replies • view on HN

You're perfectly free to scrape the web yourself and train your own model. You're not free to let Anthropic do that work for you, because they don't want you to, because it cost them a lot of time and money and secret sauce presumably filtering it for quality and other stuff.

Stole? Courts have ruled it's transformative, and it very obviously is.

AI doomerism is exhausting, and I don't even use AI that much, it's just annoying to see people who want to find any reason they can to moan.

Replies

petcat • yesterday at 1:58 PM

> Stole? Courts have ruled it's transformative, and it very obviously is.

The courts have ruled that AI outputs are not copyrightable. The courts have also ruled that scraping by itself is not illegal, only maybe against a Terms of Service. Therefore, Anthropic, OpenAI, Google, etc. have no legal claim to any proprietary protections of their model outputs.

So we have two things that are true:

1) Anthropic (certainly) violated numerous TOS by scraping all of the internet, not just public content.

2) Scraping Anthropic's model outputs is no different than what Anthropic already did. Only a TOS violation.

➕ show 2 replies

alpha_squared • yesterday at 3:02 PM

> You're perfectly free to scrape the web yourself and train your own model.

Actually, not anymore as a result of OpenAI and Anthropic's scraping. For example, Reddit came down hard on access to their APIs as a response to ChatGPT's release and the news that LLMs were built atop of scraping the open web. Most of the web today is not as open as before as a result of scraping for LLM data. So, no, no one is perfectly free to scrape the web anymore because open access is dying.

two_tasty • yesterday at 2:54 PM

"...free to scrape the web yourself and train your own model."

Yes, rich and poor are equally forbidden from sleeping under bridges.

➕ show 1 reply

jtbayly • yesterday at 1:55 PM

Wut?They did exactly the same thing!

Try this: If you want to train a model, you’re free to write your own books and websites to feed into it. You’re not free to let others do that work for you because they don’t want you to, because it cost them a lot of time and money and secret sauce presumably filtering it for quality and other stuff.

➕ show 1 reply

airstrike • yesterday at 2:13 PM

Guess who else spent a lot of time and money and secret sauce?

Do you hear the words coming out of your mouth?

nunez • yesterday at 2:57 PM

Lol; like heck we are. Try scraping the NYTimes at LLM scale. You can time how quickly you’ll get 420’ed or, at worst, hit with a C&D.

➕ show 1 reply

hax0ron3 • yesterday at 9:01 PM

It is transformative, but if I make a bunch of requests to their API and use the responses to distill my own model, that is also transformative.

andersonpico • yesterday at 3:45 PM

Your selective respect for work is a glaring double standard. The effort to produce the original content they scraped is order of magnitudes bigger than what it took to train the model, so if this wasn't enough to protect the authors from Anthropic it shouldn't be enough to protected Anthropic from people distillating their models.

Your legal argument is all over the place as well. What is more relevant here: what the courts ruled or what you consider obvious? How is distillation less transformative than scraping? How does courts ruling that scraping to train models is legal relate to distillation?

Nobody is scoring you on neutrality points for not using AI much and calling this doomerism is just a thought-terminating cliche that refuses to engage with the comment you're replying.

In fact, your comment is not engaging with anything at all, you're vaguely gesturing towards potentitial arguments without making them. If you find discussing this exhausting then don't but also don't flood the comments with low effort whining.

loremium • yesterday at 4:14 PM

reminds me of `don't look up` a bit. there clearly is an imbalance in regards to licenses with model providers, not even talking about knowledge extraction (yes younger people don't learn properly now, older generations forget) shortly before the rug-pull happens in form of accessibility to not rich people

unethical_ban • yesterday at 2:16 PM

Let's talk ethics, not law. Why is it okay for these companies to pirate books and scrape the entire web and offer synthesized summaries of all of it, lowering traffic and revenue for countless websites and professions of experts, but it is not okay for others to try to do the same to an AI model?

Is the work of others less valid than the work of a model?

➕ show 3 replies

alt Hacker News

Replies