I know this is highly contested, but I'll try explaining it anyway, because I keep seeing this and it's ... wrong.
Your comment is wrong both theoretically and practically.
First, the theory. The idea that model weights are "binary driver blobs" is technically wrong. I don't know why this is so common on a technical site, but anyway. An LLM model consists of 3 main parts: The architecture, the inference code, and some values. All of these, combined, make an LLM.
Another important aspect, that is widely misunderstood and will become apparent later is that a model is created by deciding the architecture, and then initialised with some values. Those values can be all 0s, all 1s, or random. (in practice it's random but that's irrelevant). Technically, once a model is initialised, that's it. That is a model. If released, that would be, even for the most pedantic absolutists, undoubtably open source.
Then, that model is being adapted. The most important thing to understand here, is that this is the preferred way of modifying a model. Actually, the only way. You can't (yet) come later and decide to change something in the architecture. Youc an only change the values. That process is called training (pre, mid, post, etc). The process itself is the same for the model creators, as it is for you. The technical process. The means, know-how, etc. is different.
Now, what licensing does, and the only thing that licensing can do is to give you rights to inspect, modify and release that model. That's it. A license will never give you (it cannot) the right to have the internal IP, knowledge, know-how or the "why's" on how the model was edited. That's on you. You have the right to modify, but you can't get the right to know how others have modified it, from a license file. Never had, never will.
(a simplified version of this is to think about an algorithm to control a drone. Usually that'd be a pid controller. Imagine someone releases under an open source license, an algorithm. That algorithm consists of architecture, loop code, and some values. Even if those values are all set to 0.5 (in which case your drone might crash) or any other values, the values themselves do not change the status of the code. It's still open source, even if the values are fixed, or random, or dreampt up by the original coder, or received from the aliens themselves)
I mentioned above that editing the values of a model is the preffered way of modifying the model, and that's exactly what Apache 2.0 defines as "source code".
> "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
----
Now, the practice. In practice, we do have fully open (open data, open training code, open source models) models. Apertus, from Switzerland and Olmo from the US. Don't get me wrong, it's absolutely great that we have these models, they are very important for the community, and they do help inform everyone about what works, what doesn't, and so on. But ... no-one uses them. Because they are not at the top, compared to other models.
And, on a technical note, the idea that "dataset" + training code = bit-for-bit recreation is also not true. Anyone that has done any large scale training can tell you that. Between the randomness inherent in the process, the occasional training run re-starts and so on, you will never get the same model twice (at reasonable scales), even if you'd have the available compute. Which, let's be serious, no-one at home has. So... yeah. It's a pointless aspect to care for anyway.
| Technically, once a model is initialised, that's it. That is a model. If released, that would be, even for the most pedantic absolutists, undoubtably open source.
That is true. But it is not the same model as the LLM created by combining the released weights with the released architecture. The thing that is the "binary blob" is the weights. It is pretty much exactly akin to a Linux driver that depends on linux-firmware. It is wonderful that it exists! But it is only partly open.
| Now, what licensing does, and the only thing that licensing can do is to give you rights to inspect, modify and release that model. That's it. A license will never give you (it cannot) the right to have the internal IP, knowledge, know-how or the "why's" on how the model was edited. That's on you. You have the right to modify, but you can't get the right to know how others have modified it, from a license file. Never had, never will.
| In practice, we do have fully open (open data, open training code, open source models) models. Apertus, from Switzerland and Olmo from the US. Don't get me wrong, it's absolutely great that we have these models, they are very important for the community, and they do help inform everyone about what works, what doesn't, and so on.
You seem to contradict yourself here. That said: I appreciate the correction of my perception that there aren't truly open large language models.
There are still things you can't do with an open-weight model without the training data, like modifying the architecture and training from scratch. That's different from true open-source code, where you can do anything the authors could do.
Good read thanks
The inference code is not part of a LLM and there can be multiple different implementations of it. The model, code to train the model, and code to run the modal are different things.
I don’t see how models can be licensed at all. There is no creative element in them.
As you say, you start with a random array and start mutating it until you get something that magically does interesting things.
Sure, you can hold copyright over all the software used to train the thing. And trade secrets or patents around your data selection, training methods, and infrastructure and such.
But unlike typical software compilation, the model isn’t a rote translation of something that has a creative element. Ordinary software has creative source code as input, mechanically processed into an output.
Models start with a bunch of inputs that are not the creative property of the model maker. Those non-creative inputs are not imbued with novel creativity, no matter how advanced the intermediate machinery may be.
By analogy, you may hold a copyright on the layout and creative elements of a phone book, but you have no rights over the actual data of phone numbers. Nor will any amount of ingenious layout engines or ad placement algorithms or complex printing press methods turn those numbers into something that can be licensed.
IANAL. This is truly baffling to me and it seems like everyone is going along with it because some corporate lawyer probably said “Iunno, let’s just say we are licensing this thing before release. Worst case, a court throws out the license”.