logoalt Hacker News

nonethewisertoday at 1:05 AM4 repliesview on HN

[flagged]


Replies

paxystoday at 1:11 AM

The websites, music, movies, books, photos, art that they stole didn't appear out of thin air. The amount of time and effort people have collectively poured into creating these works throughout history far, far surpasses Anthropic's own effort of converting them into model weights.

bloppetoday at 1:14 AM

The equivocation is crawling website <-> crawling LLM responses.

Both Anthropic and Alibaba are trying to build bleeding edge LLMs. That part is the same. The way they source their data is slightly different, but they would both argue it constitutes fair use under Copyright law.

walrus01today at 1:06 AM

"Your extremely efficient multi petabyte internet content suction machine is ripping off my extremely efficient multi petabyte internet content suction machine"

Sucking down petabytes of peoples' copyrighted content that they never granted a specific license to you to use seems to be an unavoidable and default part of the process of building any huge LLM.

show 1 reply
epsteingpttoday at 1:09 AM

It's not really equivocation in this instance. This feels like a 'bad faith' comment. We can do better.

LLM's literally wouldn't work without the sum total of knowledge (in the forms of books and other copyrighted content) being used as 'training data' for these LLMs.

The 'bleeding edge' LLMs required many things, but: 1 Tech innovation ('attention') 2 Lots of compute 3 Data 4 Pre + post training

#4 doesn't happen without #3.

It's pretty obvious at this point that the major providers have stolen vast amounts of #3 - they have paid nearly 0 of the creators.

We can argue about the impact (I'd lean net good) vs. the cost. But arguing there isn't a cost is a bit silly.

show 2 replies