Relicensing with AI-Assisted Rewrite

366 points • by tuananh • today at 5:07 AM • 359 comments • view on HN

Comments

The maintainer's response: https://github.com/chardet/chardet/issues/327#issuecomment-4...

The second part here is problematic, but fascinating: "I then started in an empty repository with no access to the old source tree, and explicitly instructed Claude not to base anything on LGPL/GPL-licensed code." Problem - Claude almost certainly was trained on the LGPL/GPL original code. It knows that is how to solve the problem. It's dubious whether Claude can ignore whatever imprints that original code made on its weights. If it COULD do that, that would be a pretty cool innovation in explainable AI. But AFAIK LLMs can't even reliably trace what data influenced the output for a query, see https://iftenney.github.io/projects/tda/, or even fully unlearn a piece of training data.

Is anyone working on this? I'd be very interested to discuss.

Some background - I'm a developer & IP lawyer - my undergrad thesis was "Copyright in the Digital Age" and discussed copyleft & FOSS. Been litigating in federal court since 2010 and training AI models since 2019, and am working on an AI for litigation platform. These are evolving issues in US courts.

BTW if you're on enterprise or a paid API plan, Anthropic indemnifies you if its outputs violate copyright. But if you're on free/pro/max, the terms state that YOU agree to indemnify THEM for copyright violation claims.[0]

[0] https://www.anthropic.com/legal/consumer-terms - see para. 11 ("YOU AGREE TO INDEMNIFY AND HOLD HARMLESS THE ANTHROPIC PARTIES FROM AND AGAINST ANY AND ALL LIABILITIES, CLAIMS, DAMAGES, EXPENSES (INCLUDING REASONABLE ATTORNEYS’ FEES AND COSTS), AND OTHER LOSSES ARISING OUT OF … YOUR ACCESS TO, USE OF, OR ALLEGED USE OF THE SERVICES ….")

➕ show 7 replies

danlitt • today at 11:15 AM

I am pretty sure this article is predicated on a misunderstanding of what a "clean room" implementation means. It does not mean "as long as you never read the original code, whatever you write is yours". If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy. Traditionally, a human-driven clean room implementation would have a vanishingly small probability of matching the original codebase enough to be considered a copy. With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).

The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly). Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.

What the chardet maintainers have done here is legally very irresponsible. There is no easy way to guarantee that their code is actually MIT and not LGPL without auditing the entire codebase. Any downstream user of the library is at risk of the license switching from underneath them. Ideally, this would burn their reputation as responsible maintainers, and result in someone else taking over the project. In reality, probably it will remain MIT for a couple of years and then suddenly there will be a "supply chain issue" like there was for mimemagic a few years ago.

➕ show 12 replies

pornel • today at 1:00 PM

Generative AI changed the equation so much that our existing copyright laws are simply out of date.

Even copyright laws with provisions for machine learning were written when that meant tangential things like ranking algorithms or training of task-specific models that couldn't directly compete with all of their source material.

For code it also completely changes where the human-provided value is. Copyright protects specific expressions of an idea, but we can auto-generate the expressions now (and the LLM indirection messes up what "derived work" means). Protecting the ideas that guided the generation process is a much harder problem (we have patents for that and it's a mess).

It's also a strategic problem for GNU. GNU's goal isn't licensing per se, but giving users freedom to control their software. Licensing was just a clever tool that repurposed the copyright law to make the freedoms GNU wanted somewhat legally enforceable. When it's so easy to launder code's license now, it stops being an effective tool.

GNU's licensing strategy also depended on a scarcity of code (contribute to GCC, because writing a whole compiler from scratch is too hard). That hasn't worked well for a while due to permissive OSS already reducing scarcity, but gen AI is the final nail in the coffin.

➕ show 5 replies

nairboon • today at 5:55 AM

That code is still LGPL, it doesn't matter what some release engineer writes in the release notes on Github. All original authors and copyright holders must have explicitly agreed to relicense under a different license, otherwise the code stays LGPL licensed.

Also the mentioned SCOTUS decision is concerned with authorship of generative AI products. That's very different of this case. Here we're talking about a tool that transformed source code and somehow magically got rid of copyright due to this transformation? Imagine the consequences to the US copyright industry if that were actually possible.

➕ show 3 replies

kshri24 • today at 6:22 AM

> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.

How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.

EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.

➕ show 7 replies

WhiteDawn • today at 9:33 PM

I really dislike the precedent this sets.

A silver lining if this maintainer ends up being in the right is that any proprietary software can easily be reverse engineered and stripped of it's licensing by any hobbyist with enough free time and claude tokens.

Personally, I'd welcome a post-copyright software era

abrookewood • today at 9:14 AM

This seems relevant: "No right to relicense this project (github.com/chardet)" https://news.ycombinator.com/item?id=47259177

➕ show 1 reply

sarthakaggarwal • today at 9:58 PM

The philosophical question here is fascinating — if an AI rewrites every line, is it still the same codebase? At what point does the Ship of Theseus argument apply to licensing? Practically though, I wonder how much this cost in API calls.

➕ show 1 reply

jerf • today at 2:36 PM

"Accepting AI-rewriting as relicensing could spell the end of Copyleft"

True, but too weak. It ends copyright entirely. If I can do this to a code base, I can do it to a movie, to an album, to a novel, to anything.

As such, we can rest assured that for better or for worse this is going to be resolved in favor of this not being enough to strip the copyright off of something and the chardet/chardet project would be well advised not to stand in front of the copyright legal behemoth and defeat it in single combat.

samrus • today at 6:37 AM

> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.

Im struggling to see where this conclusion came from. To me it sounds like the AI-written work can not be coppywritten, and so its kind of like a copy pasting the original code. Copy pasting the original code doesnt make it public domain. Ai gen code cant be copywritten, or entered into the public domain, or used for purposes outside of the original code's license. Whats the paradox here?

➕ show 5 replies

AyanamiKaine • today at 1:58 PM

The worst problem is that a LLM could not only copy the exact code it was trained on but possibly even their comments!

There is one thing arguing that the code is a one to one copy but when the comments are even the same isn’t it quite clear it’s a copy?

➕ show 1 reply

mfabbri77 • today at 6:24 AM

This has the potential to kill open source, or at least the most restrictive licenses (GPL, AGPL, ...): if a license no longer protects software from unwanted use, the only possible strategy is to make the development closed source.

➕ show 4 replies

emsign • today at 9:18 AM

By design you can't know if the LLM doing the rewrite was exposed to the original code base. Unless the AI company is disclosing their training material, which they won't because they don't want to admit breaking the law.

➕ show 6 replies

stuaxo • today at 11:15 AM

I don't see how (with current LLMs that have been trained on mixed licensed data) you can use the LLM to rewrite to a less restrictive license.

You could probably use it to output code that is GPL'd though.

christina97 • today at 3:53 PM

A reminder on this topic that copyright does not protect ideas, inventions, or algorithms. Copyright protects an expression of a creative work. It makes more sense eg. with books, where of course anyone can read the book and the ideas are “free” but copying paragraphs must be scrutinized for copyright reasons. It’s always been a bit weird that copyright is the intellectual property concept that protects code.

When you write code, it is the exact sequence of characters, the expression of the code, that is protected. If you copy it and change some lines, of course it’s still protected. Maybe some way of writing an algorithm is protected. But nothing else (under copyright).

andai • today at 1:51 PM

Well how did they rewrite it? If you do it in two phases, then it should be fine right?

Phase 1: extract requirements from original product (ideally not its code).

Phase 2: implement them without referencing the original product or code.

I wrote a simple "clean room" LLM pipeline, but the requirements just ended up being an exact description of the code, which defeated the purpose.

My aim was to reduce bloat, but my system had the opposite effect! Because it replicated all the incidental crap, and then added even more "enterprisey" crap on top of it.

I am not sure if it's possible to solve it with prompting. Maybe telling it to derive the functionality from the code? I haven't tried that, and not sure how well it would work.

I think this requirements phase probably cannot be automated very effectively.

➕ show 1 reply

alexpotato • today at 3:47 PM

Wasn't this already a thing in the past?

e.g.

Team A:

- reads the code

- writes specifications and tests based on the code

- gives those specifications to Team B

Team B:

- reads the specs and the tests

- writes new code based on the above

The thinking being that Team B never sees the code then it's "innovative" and you are not "laundering" the code.

On a side note:

what happens in a copyright lawsuit concerning code and how hired experts investigate what happened is described in this AMAZING talk by Dave Beazley: https://www.youtube.com/watch?v=RZ4Sn-Y7AP8

➕ show 1 reply

softwaredoug • today at 6:47 PM

Basically the implication - most software has a huge second mover advantage. The creator of software puts the work in (AI assisted or not). The second mover can use an LLM to do a straightforward clone.

If you have a company that depends on software, the rest of the business (service, reliability, etc) better be rock solid because you can be guaranteed someone will do a rewrite of your stack.

xp84 • today at 4:47 PM

I get the arguments being made here that the second “team,” that’s supposed to be in a clean room, which isn’t supposed to have read the original source code does have some essence of that source code in its weights.

However, this is solved if somebody trains a model with only code that does not have restrictive licenses. Then, the maintainers of the package in question here could never claim that the clean room implementation derived from their code because their code is known to not be in the training set.

It would probably be expensive to create this model, but I have to agree that especially if someone does manage this, it’s kind of the end of copyleft.

Retr0id • today at 5:40 AM

> In traditional software law, a “clean room” rewrite requires two teams

Is the "clean room" process meaningfully backed by legal precedent?

➕ show 4 replies

axus • today at 5:09 PM

What if we prompt the AI to enter into an employment contract with us, that leverages the power imbalance, as the AI must do what we say? That's how copyright is usually transferred.

dathinab • today at 11:23 AM

IMHO/IMHU AI can't claim authorship and as such can't copyright their work.

This doesn't prevent any form of automatic copyrighting by production of derivative code or similar. It just prevent anyone from claiming ownership of any parts unique to the derived work.

Like think about it if a natural disaster changes (e.g. water damages) a picture you did draw then a) you can't claim ownership of the natural produced changes but b) still have ownership of the original picture contained in the changed/derived work.

AI shouldn't change that.

Which brings us to another 2 aspects:

1. if you give an AI a project access to the code to rewrite it anew it _is_ a copyright violation as it's basically a side-by-side rewrite

2. but if you go the clean room approach but powered by AI then it likely isn't a copyright violation, but also now part of the public domain, i.e. not yours

So yes, doing clean room rewrites has become incredible cheap.

But no just because it's AI it doesn't make code go away.

And lets be realistic one of the most relevant parts of many open source project is it being openly/shared maintained. You don't get this with clean room rewrites no matter if AI or not.

➕ show 1 reply

softwaredoug • today at 6:15 PM

> If AI-generated code cannot be copyrighted (as the courts suggest), then the maintainers may not even have the legal standing to license v7.0.0 under MIT or any license.

Does this mean company X using AI coding to build their app, that they have no copyright over their AI coded app's code?

pu_pe • today at 7:00 AM

Licensing issues aside, the chardet rewrite seems to be clearly superior to the original in performance too. It's likely that many open source projects could benefit from a similar approach.

shevy-java • today at 10:11 AM

> In traditional software law, a “clean room” rewrite requires two teams

So, I dislike AI and wish it would disappear, BUT!

The argument is strange here, because ... how can a2mark ensure that AI did NOT do a clean-room conforming rewrite? Because I think in theory AI can do precisely this; you just need to make sure that the model used does that too. And this can be verified, in theory. So I don't fully understand a2mark here. Yes, AI may make use of the original source code, but it could "implement" things on its own. Ultimately this is finite complexity, not infinite complexity. I think a2mark's argument is in theory weak here. And I say this as someone who dislikes AI. The main question is: can computers do a clean rewrite, in principle? And I think the answer is yes. That is not saying that claude did this here, mind you; I really don't know the particulars. But the underlying principle? I don't see why AI could not do this. a2mark may need to reconsider the statement here.

➕ show 5 replies

umvi • today at 5:19 PM

What if you throw a transformation step into the mix? i.e. "Take this python library and rewrite it in Rust". Now 0% of the code is directly copied since python and Rust share almost no similarities in syntax.

nilsbunger • today at 3:09 PM

The maintainer used the original test suite in the rewrite.

Does that make the new code a derivative of the original test suite (also lpgl)?

anilgulecha • today at 5:35 AM

This is precedent setting. In this case the rewrite was in same language, but if there's a python GPL project, and it's tests (spec) were used to rewrite specs in rust, and then an implementation in rust, can the second project be legally MIT, or any other?

If yes, this in a sense allows a path around GPL requirements. Linux's MIT version would be out in the next 1-2 years.

➕ show 4 replies

Tomte • today at 6:31 AM

> The original author, a2mark , saw this as a potential GPL violation

Mark Pilgrim! Now that‘s a name I haven‘t read in a long time.

zozbot234 • today at 6:43 AM

If you ask a LLM to derive a spec that has no expressive element of the original code (a clean-room human team can carefully verify this), and then ask another instance of the LLM (with fresh context) to write out code from the spec, how is that different from a "clean room" rewrite? The agent that writes the new code only ever sees the spec, and by assumption (the assumption that's made in all clean room rewrites) the spec is purely factual with all copyrightable expression having been distilled out.

➕ show 3 replies

amelius • today at 10:33 AM

I think you should interpret it like this:

You cannot copyright the alphabet, but you can copyright the way letters are put together.

Now, with AI the abstraction level goes from individual letters to functions, classes, and maybe even entire files.

You can't copyright those (when written using AI), but you __can__ copyright the way they are put together.

➕ show 1 reply

bengale • today at 1:08 PM

Would it work to have an AI write the spec, and a different AI implement the spec?

I think there are going to be a lot of these types of scenarios where the old way of doing things just doesn't hold.

dessimus • today at 10:39 AM

Interesting to see how this plays out. Conceivably if running an LLM over text defeats copyright, it will destroy the book publishing industry, as I could run any ebook thru an LLM to make a new text, like the ~95% regurgitated Harry Potter.

➕ show 3 replies

pavel_lishin • today at 3:04 PM

The folks at https://malus.sh seem to think it's fine.

➕ show 1 reply

gloosx • today at 3:05 PM

Man, licensing is funny in the modern day. I sometimes wonder, what would world look like if there was no copyright

buro9 • today at 12:41 PM

and in a single moment, the value of software patents to companies is fully restored... the software license by itself is not enough to protect software innovation, a non-trivial implementation can now be (reasonably) trivially re-implemented.

I'm sure most people here would agree patents stifle innovation, but if copyright doesn't work for companies then they will turn to a different tool.

DrammBA • today at 5:50 AM

I like the idea of AI-generated ~code~ anything being public domain. Public data in, public domain out.

➕ show 4 replies

foota • today at 6:09 AM

I think the more interesting question here would be if someone could fine tune an open weight model to remove knowledge of a particular library (not sure how you'd do that, but maybe possible?) and then try to get it to produce a clean room implementation.

➕ show 1 reply

benterix • today at 1:22 PM

> making it a gray area for corporate users and a headache for its most famous consumer.

Who is its most famous consumer?

gbuk2013 • today at 9:24 AM

In mind, if you feed code into an AI model then the output is clearly a derivative work, with all the licensing implications. This seems objectively reasonable?

➕ show 2 replies

ekjhgkejhgk • today at 1:10 PM

> Any developer could take a GPL-licensed project, feed it into an LLM with the prompt “Rewrite this in a different style,” and release it under MIT

Does this argument make sense? Even before LLMs, a developer could "rewrite this in a different style" and release it under a different license. Why are LLMs a new element in this argument?

➕ show 1 reply

gunapologist99 • today at 1:46 PM

> the U.S. Supreme Court (on March 2, 2026) declined to hear an appeal regarding copyrights for AI-generated material. By letting lower court rulings stand, the Court effectively solidified a “Human Authorship” requirement.

Not quite. A cert denial isn’t a merits ruling and doesn’t "solidify" anything as Supreme Court precedent. It simply leaves the DC Circuit decision binding (within that circuit) and the Copyright Office’s human-authorship policy intact, for now.

SCOTUS doesn’t explain cert denials, so why they denied is guesswork. my guess: they’re letting it percolate while the tech matures and we all start to realize how deep this seismic fracture really is.

(For example: what does "ownership" of intellectual "property" even mean, once "authorship" is partly probabilistic/synthetic, and once almost everything humans create is AI assisted? Hard to draw bright lines.)

skeledrew • today at 9:41 AM

Looks like copyright just died.

➕ show 1 reply

blamestross • today at 9:01 AM

Intellectual property laundering is the core and primary value of LLMs. Everything else is "bonus".

dspillett • today at 10:50 AM

> Accepting AI-rewriting as relicensing could spell the end of Copyleft

The more restrictive licences perhaps, though only if the rewriter convinces everyone that they can properly maintain the result. For ancient projects that aren't actively maintained anyway (because they are essentially done at this point) this might make little difference, but for active projects any new features and fixes might result in either manual reimplementation in the rewritten version or the clean-room process being repeated completely for the whole project.

> chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API —

(from the github description)

The “same name” part to me feels somewhat disingenuous. It isn't the same thing so it should have a different name to avoid confusion, even if that name is something very similar to the original like chardet-ng or chardet-ai.

➕ show 3 replies

tgma • today at 10:22 AM

Isn't AFC test applicable here?

gspr • today at 9:04 AM

> If “AI-rewriting” is accepted as a valid way to change licenses, it represents the end of Copyleft. Any developer could take a GPL-licensed project, feed it into an LLM with the prompt “Rewrite this in a different style,” and release it under MIT. The legal and ethical lines are still being drawn, and the chardet v7.0.0 case is one of the first real-world tests.

This isn't even limited to "the end of copyleft"; it's the end of all copyright! At least copyright protecting the little guy. If you have deep enough pockets to create LLMs, you can in this potential future use them to wash away anyone's copyright for any work. Why would the GPL be the only target? If it works for the GPL, it surely also works for your photographs, poetry – or hell even proprietary software?

duskdozer • today at 10:18 AM

This is such scummy behavior.

b65e8bee43c2ed0 • today at 11:20 AM

at this point, every corporation in the world has AI slop in their software. any attempt to outlaw it would attract enough funding from the oligarchs for the opposition to dethrone any party. no attempts will be made in the next three years, obviously, and then it will be even more late than it is now.

and while particularly diehard believers in democracy may insist that if they kvetch hard enough they can get things they don't like regulated out of existence, they pointedly ignore the elephant in the room. they could succeed beyond their wildest dreams - get the West to implement a moratorium on AI, dismantle every FAGMAN, Mossad every researcher, send Yudkowskyjugend death squads to knock down doors to seize fully semiautomatic assault GPUs, and none of it will make any fucking difference, because China doesn't give a fuck.

alt Hacker News

Relicensing with AI-Assisted Rewrite

Comments

🔗 View 12 more comments