I'm no longer certain what point you're trying to make. Getting close less than half the...

xyzzy_plugh • last Thursday at 2:40 PM • 1 reply • view on HN

I'm no longer certain what point you're trying to make.

Getting close less than half the time given three paragraphs as input still sounds like red-handed copyright infringement to me.

If I sample a copyrighted song in my new track, clip it, slow it down, and decimate the bit rate, a court would not let me off the hook.

It doesn't matter how much context you push into these things. If I feed them 50% of Moby Dick and they produce the next word, and I can repeatedly do that to produce the entire book (I'm pretty sure the number of attempts is wholly irrelevant: we're impossibly far from monkeys on typewriters) then we can prove the statistical model encodes the book. The further we are from that (and the more we can generate with less) then the stronger the case is. It's a pretty strong case!

Replies

ndriscoll • last Thursday at 2:53 PM

That's... not how this works.

> If I feed them 50% of Moby Dick and they produce the next word and I can repeatedly do that to produce the entire book... then we can prove the statistical model encodes the book.

It can't because it doesn't. That's what it means to say it diverges.

The "number of attempts" is you cheating. You're giving it the book when you let it try again word by word until it gets the correct answer, and then claiming it produced the book. That's exactly the residual that I said characterizes the extent to which it doesn't know the book. Trivially, no matter how bad the model is, if you give it the residual, it can losslessly compress anything at all.

If you had a simple model that just predicts next word given current word (trained on word pair frequency across all English text, or even all text excluding Moby Dick), and then give it retries until it gets the current word right, it will also quickly produce the book. Because it was your retry policy that encoded the book, not the model. Without that policy, it will get it wrong within a few words, just like these models do.

➕ show 2 replies

alt Hacker News

Replies