logoalt Hacker News

jubilantilast Saturday at 11:56 PM3 repliesview on HN

Olmo from AllenAI has been releasing their full pipelines including data [1]. A lot of it is just repackaged and resampled dumps from copyrighted data that has long been publicly available as dumps: Common Crawl, arxiv, Wikipedia, StackExchange, reddit --- all of which are presumably copyrighted with different licenses. Go in Huggingface and you can find massive multi TB data dumps used for pre training.

It is just as legal as when Uber and AirBNB were running illegal taxis and hotels during their growth phase. I'm just waiting for some corporate IP law firm to learn about Huggingface.

[1] https://huggingface.co/datasets/allenai/dolma3_pool


Replies

__floatyesterday at 5:47 AM

It's rather off-topic at this point, but I've never understood how HF can afford to be a CDN for such huge files. It seems like enterprise customers must be subsidizing a lot, but...at that point, is there not a cheaper alternative that doesn't subsidize every hobbyist and startup around?

show 2 replies
hnfongyesterday at 6:54 AM

> I'm just waiting for some corporate IP law firm to learn about Huggingface.

Presumably they already know. The issue is that IP law firms are tiny compared to the trillions of capital pouring into "AI". And if you believe the USA is a capitalist country where the side with deeper pockets win, you know you're not going to win against the trillionaires.

alchemist1e9yesterday at 11:31 AM

Why is the text field in dataset preview table populated with pornographic labels?