logoalt Hacker News

clhodappyesterday at 4:02 PM2 repliesview on HN

I know it comes off as pedantic to point this out but: Those are open weight models not open source models.

Closed weight models are the equivalent of SaaS. Open weight models are the equivalent of binary driver blobs or Windows software. We don't really have actual open source LLMs, which would need to publicly release their training data and technique so you could train a similar model yourself, or use their work as a baseline for your own model.

This distinction matters because an actual open source LLM would be extremely important from an ecosystem point of view, if someone ever actually released one.


Replies

NitpickLawyeryesterday at 4:52 PM

I know this is highly contested, but I'll try explaining it anyway, because I keep seeing this and it's ... wrong.

Your comment is wrong both theoretically and practically.

First, the theory. The idea that model weights are "binary driver blobs" is technically wrong. I don't know why this is so common on a technical site, but anyway. An LLM model consists of 3 main parts: The architecture, the inference code, and some values. All of these, combined, make an LLM.

Another important aspect, that is widely misunderstood and will become apparent later is that a model is created by deciding the architecture, and then initialised with some values. Those values can be all 0s, all 1s, or random. (in practice it's random but that's irrelevant). Technically, once a model is initialised, that's it. That is a model. If released, that would be, even for the most pedantic absolutists, undoubtably open source.

Then, that model is being adapted. The most important thing to understand here, is that this is the preferred way of modifying a model. Actually, the only way. You can't (yet) come later and decide to change something in the architecture. Youc an only change the values. That process is called training (pre, mid, post, etc). The process itself is the same for the model creators, as it is for you. The technical process. The means, know-how, etc. is different.

Now, what licensing does, and the only thing that licensing can do is to give you rights to inspect, modify and release that model. That's it. A license will never give you (it cannot) the right to have the internal IP, knowledge, know-how or the "why's" on how the model was edited. That's on you. You have the right to modify, but you can't get the right to know how others have modified it, from a license file. Never had, never will.

(a simplified version of this is to think about an algorithm to control a drone. Usually that'd be a pid controller. Imagine someone releases under an open source license, an algorithm. That algorithm consists of architecture, loop code, and some values. Even if those values are all set to 0.5 (in which case your drone might crash) or any other values, the values themselves do not change the status of the code. It's still open source, even if the values are fixed, or random, or dreampt up by the original coder, or received from the aliens themselves)

I mentioned above that editing the values of a model is the preffered way of modifying the model, and that's exactly what Apache 2.0 defines as "source code".

> "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.

----

Now, the practice. In practice, we do have fully open (open data, open training code, open source models) models. Apertus, from Switzerland and Olmo from the US. Don't get me wrong, it's absolutely great that we have these models, they are very important for the community, and they do help inform everyone about what works, what doesn't, and so on. But ... no-one uses them. Because they are not at the top, compared to other models.

And, on a technical note, the idea that "dataset" + training code = bit-for-bit recreation is also not true. Anyone that has done any large scale training can tell you that. Between the randomness inherent in the process, the occasional training run re-starts and so on, you will never get the same model twice (at reasonable scales), even if you'd have the available compute. Which, let's be serious, no-one at home has. So... yeah. It's a pointless aspect to care for anyway.

show 5 replies
yogthosyesterday at 4:44 PM

There are absolutely fully open source models. These are not frontier models, but they very much do exist. OLMo is one of the models explicitly mentioned as having passed the OSI's validation phase. Pythia was also validated by the OSI as meeting its requirements for an open-source AI system. Lucie-7B is a multilingual model is one of the first LLM compliant with the OSI AI definition. Its creators explicitly state that the training dataset, data preparation code, and model weights are all publicly available under open licenses.