logoalt Hacker News

meander_wateryesterday at 9:46 AM2 repliesview on HN

I think most labs actively create synthetic data using existing model as part of the mix for the pretraining stage for their next model.

Would love to know exactly what the latest process is to keep slop out of training data.


Replies

martinaldyesterday at 11:17 AM

const isAiContent = (str) => str.includes('—');?

:)

show 1 reply
madamelicyesterday at 12:50 PM

I think everyone overblows the whole "AI is poisoning AI!" thing. It could be a problem but the genuine value in Reddit or any other human social media is honestly pretty low from my estimates. It's great for seeing how humans talk but in terms of 'nutritional' value for truth or answers... I am not sold. If I was choosing what to 'feed' AI, I wouldn't even bother with textual social media (besides Github / Gitlab / other source control)

There's way more value, if seeking out answers, in following the links to external sources, scraping books, and other sources that aren't "unwashed masses saying whatever they want".

show 2 replies