Have any major open weight models been "open data"? Wouldn't that entail distributing...

phainopepla2 • yesterday at 10:26 PM • 3 replies • view on HN

Have any major open weight models been "open data"? Wouldn't that entail distributing vast amounts of copyrighted data?

Replies

Olmo from AllenAI has been releasing their full pipelines including data [1]. A lot of it is just repackaged and resampled dumps from copyrighted data that has long been publicly available as dumps: Common Crawl, arxiv, Wikipedia, StackExchange, reddit --- all of which are presumably copyrighted with different licenses. Go in Huggingface and you can find massive multi TB data dumps used for pre training.

It is just as legal as when Uber and AirBNB were running illegal taxis and hotels during their growth phase. I'm just waiting for some corporate IP law firm to learn about Huggingface.

[1] https://huggingface.co/datasets/allenai/dolma3_pool

➕ show 3 replies

my123 • today at 12:02 AM

NVIDIA's recent Nemotrons tend to be open training data and code.

Probably as a base to use by people buying NVIDIA hardware to train their own.

➕ show 1 reply

tuananh • today at 1:00 AM

ibm granite has been open data from the beginning iirc

alt Hacker News

Replies