logoalt Hacker News

ndriscollyesterday at 2:23 PM1 replyview on HN

Are you referring to this?

https://osyuksel.github.io/blog/reconstructing-moby-dick-llm...

I see a test where one model managed to 85% reproduce a paragraph given 3 input paragraphs under 50% of the time.

So it can't even produce 1 paragraph given 3 as input, and it can't even get close half the time.

"Contains Moby Dick" would be something like you give it the first paragraph and it produces the rest of the book. What we have here instead is a statistical model that when given passages can do an okay job at predicting a sentence or two, but otherwise quickly diverges.


Replies

xyzzy_plughyesterday at 2:40 PM

I'm no longer certain what point you're trying to make.

Getting close less than half the time given three paragraphs as input still sounds like red-handed copyright infringement to me.

If I sample a copyrighted song in my new track, clip it, slow it down, and decimate the bit rate, a court would not let me off the hook.

It doesn't matter how much context you push into these things. If I feed them 50% of Moby Dick and they produce the next word, and I can repeatedly do that to produce the entire book (I'm pretty sure the number of attempts is wholly irrelevant: we're impossibly far from monkeys on typewriters) then we can prove the statistical model encodes the book. The further we are from that (and the more we can generate with less) then the stronger the case is. It's a pretty strong case!

show 1 reply