logoalt Hacker News

OkayPhysicistyesterday at 6:27 PM1 replyview on HN

> "Literal CTRL+C, CTRL+V" is the only thing copyright has ever applied to

This is extremely false. Copyright additionally grants you exclusive control over the production and distribution of derivative works.

A "derivative work" is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications which, as a whole, represent an original work of authorship, is a "derivative work".

A training set is just an anthology, and the training process is condensation. That makes the weights a derivative work of every work in the training set.

Now, there's a separate discussion to be had about whether that derivative work meets the criteria for fair use, but that's it's own tangent.


Replies

Gormoyesterday at 6:41 PM

> This is extremely false. Copyright additionally grants you exclusive control over the production and distribution of derivative works.

A derivative work is a work that itself includes copyrighted content from the original work.

That is to say that for something to be a derivative work, some measure of its content must be "CTRL-C, CTRL-V" from the originating work.

Something that's merely inspired by another work, or draws underlying themes or factual knowledge from it, is not a derivative work.

> A training set is just an anthology,

Which might make the training set itself a derivative work, but works created by using the model trained on that anthology are a different matter.

> and the training process is condensation.

No, it isn't. It's the creation of a new work that represents patterns extrapolated or interpolated from the data set, without the resulting model actually including any of the copyrighted elements of the work.

The underlying ideas and facts in the original work were never protected by copyright. Only the specific fixed form of expression is copyrightable.

Someone who looks at a dozen code examples in public repos to learn how to do e.g. a quick sort, then upon understanding the logic flow of the quick sort algorithm, writes his own quick sort implementation is not creating a derivative work of the code in the repos he exampled. And the way LLMs work is much more similar to that process than to the "compressed anthology" concept you're describing.

show 1 reply