labs invest multiple billion dollars a year each in private data, and that number is growing...

ivanovm • yesterday at 2:59 PM • 11 replies • view on HN

labs invest multiple billion dollars a year each in private data, and that number is growing. internet training data is not where frontier capabilities come from, this view is outdated

Replies

Salgat • yesterday at 3:23 PM

This is a misleading statement. The "private data" is still largely publicly produced data that has been curated through private agreements instead of scraping, such as reddit posts/comments (this is the "third-party data agreements" that companies like OpenAI mention). And yes, there is still a lot of processing done on this data, which is the norm for preparing training data.

➕ show 2 replies

maplethorpe • yesterday at 4:35 PM

Why are the leading models capable of regurgitating full copyrighted works such as "Harry Potter" and "On the Road"? Did they hire someone to type those out for them?

https://arxiv.org/abs/2601.02671

calgoo • yesterday at 3:17 PM

When did they start doing so? We all know that they DID train on all the available public information, so at what point did they stop? Is the public information still in the training set? If so, they should STILL release ALL the data as public, as they are including training data that was acquired without permission.

➕ show 1 reply

no_multitudes • yesterday at 3:40 PM

> internet training data is not where frontier capabilities come from

In that case, it should be no problem for the labs to train their new models without using public data, right?

islandfox100 • yesterday at 3:09 PM

Then it should be simple for one of the frontier labs to produce a model trained only on private data. We haven't seen that.

➕ show 1 reply

disgruntledphd2 • yesterday at 3:50 PM

> internet training data is not where frontier capabilities come from

We 100% would not be at the current progress without it, though. And it's not like they only train on this once. They keep training on all the internet data PLUS the private data. Private data only (probably) wouldn't work, as learning the base regularities of language takes a lot of weights.

4bpp • yesterday at 4:56 PM

Define "come from". Could they have gotten those frontier capabilities, or any capabilities, without internet training data? It seems to me that without the private data, you might get a slightly less competitive model, but without the CommonCrawl-style data piles used in "pretraining", you get no model at all.

Even accepting the copying-as-theft framing, if I go to a village, steal some vegetables from everyone's gardens and ham from their sheds, and then add some prohibitively expensive spices I bought myself to make soup, do I get to claim it as mine and punish the villagers for trying to take it?

Guillaume86 • yesterday at 3:31 PM

Great way to launder illegally obtained data too.

pastel8739 • yesterday at 3:19 PM

Does this private data come from places like Reddit, Twitter, etc., where it’s contributed by users? I think it is unethical for these companies to accept payment for user-contributed data.

shimman • yesterday at 3:10 PM

Okay that's fine, then make the law say they must provide publicly owned models off of publicly obtained data. To think that such a baseline of critical information isn't is the literal foundation of everything they will do, both now in the future, is just exposing what their end game is: control.

There no reason to not to otherwise outside of the poor little billion dollar corporations not wanting to provide a public utility they stolen from the public.

Anything that removes control from American big tech is a good thing for American citizens and the world writ large.

bfjvibybd6cuvu6 • yesterday at 3:19 PM

No, you're talking about fine tuning and most of it is coming from your customers or someone else's. Get off ya high horse.

Companies can't be trusted with societies need for open progress.

➕ show 1 reply

alt Hacker News

Replies