That's one aspect, which is a bit of a gray zone. But Anthropic trained on pirated books. That is explicitly illegal.
so did Meta for Llama.
The entire chat thread and email exchange was exposed in Discovery; apparently Zuck signed off on it. In one of the IM exchanges one of them say ‘everyone is doing it’
As I understand it what was "explicitly illegal" was copying the books, in the sense of mere copying before feeding them to the model, and this is what the Anthropic copyright settlement is about.
Actually processing them through the model, though, was considered transformative and therefore fair use.
They didn't train on the books and that court only found that the pirating was illegal anyway.
I'd love to see an open-source project that's basically a Torrent client for downloading pirated material, but it trains an AI model "in the background" using the downloaded content. That way everyone can claim fair use for possessing copyrighted material, I mean there's precedent right?
That ship has sailed, I would wager all the AI labs are ingesting anything human generated, whether that means Hollywood movies, Taylor Swift’s discography, YouTube videos or private GitHub source repos.
The reward for having a competitive edge is exponentially higher than the risk of a lawsuit. Politicians are still old bureaucrats who don’t understand technology.