Python 'Chardet' Package Replaced With LLM-Generated Clone, Re-Licensed

(Friday March 06, 2026 @05:00PM (BeauHD) from the laundered-via-LLM dept.)

Reference: 0180919356
News link: https://developers.slashdot.org/story/26/03/06/1614252/python-chardet-package-replaced-with-llm-generated-clone-re-licensed
Source link:

Ancient Slashdot reader [1]ewhac writes:

> The maintainers of the Python package [2]`chardet` , which attempts to automatically detect the character encoding of a string, announced the release of version 7 this week, claiming a speedup factor of 43x over version 6. In the release notes, the maintainers claim that version 7 is, "a ground-up, MIT-licensed rewrite of chardet." Problem: The putative "ground-up rewrite" is actually the result of running the existing copyrighted codebase and test suite through the Claude LLM. In so doing, the maintainers claim that v7 now represents a unique work of authorship, and therefore may be offered under a new license. Version 6 and earlier was licensed under the [3]GNU Lesser General Public License (LGPL). Version 7 claims to be available under the [4]MIT license .

>

> The maintainers appear to be claiming that, under the [5]Oracle v. Google decision , which found that cloning public APIs is fair use, their v7 is a fair use re-implementation of the `chardet` public API. However, there is no evidence to suggest their re-write was under "clean room" conditions, which traditionally has shielded cloners from infringement suits. Further, the copyrightability of LLM output has yet to be settled. Recent court decisions seem to favor the view that LLM output is [6]not copyrightable , as the output is not primarily the result of human creative expression -- the endeavor copyright is intended to protect. Spirited discussion has ensued in [7]issue #327 on `chardet`s GitHub repo , raising the question: Can copyrighted source code be laundered through an LLM and come out the other end as a fresh work of authorship, eligible for a new copyright, copyright holder, and license terms? If this is found to be so, it would allow malicious interests to completely strip-mine the Open Source commons, and then sell it back to the users without the community seeing a single dime.

[1] https://slashdot.org/~ewhac

[2] https://github.com/chardet/chardet/

[3] https://en.wikipedia.org/wiki/GNU_Lesser_General_Public_License

[4] https://en.wikipedia.org/wiki/MIT_License

[5] https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc.

[6] https://yro.slashdot.org/story/26/03/03/0545246/ai-generated-art-cant-be-copyrighted-after-supreme-court-declines-to-review-the-rule

[7] https://github.com/chardet/chardet/issues/327

Why not apply this to code as well? (Score:5, Insightful)

by FictionPimp ( 712802 )

The U.S. Supreme Court has ruled that AI-generated artwork cannot be copyrighted because it lacks human authorship, reaffirming that copyright law requires works to be created by humans. This decision follows a case involving Stephen Thaler's AI-generated artwork, which was denied copyright protection by the U.S. Copyright Office.

Re:Why not apply this to code as well? (Score:4, Interesting)

by Uninvited Guest ( 237316 )

> The U.S. Supreme Court has ruled that AI-generated artwork cannot be copyrighted because it lacks human authorship, reaffirming that copyright law requires works to be created by humans. This decision follows a case involving Stephen Thaler's AI-generated artwork, which was denied copyright protection by the U.S. Copyright Office.

*effectively ruled. The SCOTUS declined to take the appeal, leaving in place the lower appeals court ruling.The ruling doesn't specifically include source code, but there's nothing in the ruling (or in copyright law) to suggest an exception for AI-generated source code. It sure sounds like chardet v7 is in the public domain from creation, and cannot be restricted by any license.

Re:Why not apply this to code as well? (Score:5, Informative)

by DarkOx ( 621550 )

This isn't really the same question though.

Now that ruling may imply the newly generated library can't be licensed at all because it can't be copyrighted.

However the question is can you tell an LLM to re-implement some other IP and then claim that does not infringe on the original / isn't subject to its license.

I honestly don't understand why it would not come back to the same rulings that have been made about clean room implementations in the past and the fact that you can't copyright an interface.

In the closed source case where we can assume the original could not have been in the training set, and you give claude nothing but the API doc and say make me a library that behaves exactly like this description - I don't see how that could infringe on the original.

On the other hand if you provide the source to the original, I don't see how it couldn't. Just like if I renamed all the characters in Harry Potter, and used a thesaurus to replace every fifth word, JK Rolling would probably have little trouble suing me.

In the case FOSS where its down right probably the model was trained on the source, we are back to the unsettled questions of how much of the original content survives, how likely is the model to generate outputs that don't materially differ from the original, and the usual cases by case disputes about when something is materially different...

Re: (Score:2)

by karmawarrior ( 311177 )

One complication is that while the maintainers maintain that they didn't expose Claude to the original source code and specifically told Claude not to do that, is that really a thing, given the degree to which LLMs are being trained on Github, and therefore Claude has, actually, seen the source code, and probably can't be relied upon to have "forgotten" what it looked like.

I don't think that's necessarily an insurmountable hurdle, but someone examining the code who finds sections that are substantially the

Re: (Score:2)

by karmawarrior ( 311177 )

It sure sounds like chardet v7 is in the public domain from creation, and cannot be restricted by any license. ...but only if it can be proven it's not based on the GPL'd work. So basically the maintainers can't relicense it, they can either maintain it's genuinely clean-room in which case there is no copyright and they can't license it under MIT or anything else (and probably need to warn potential users of potential issues applying copyright protection to software that uses it), or they can admit its

Re: (Score:2)

by ceoyoyo ( 59147 )

Why would they need to warn anybody? If the code is public domain someone can use it freely.

Looking at who the top maintainers are, I suspect the goal here was to remove the restrictions imposed by the GPL so the software could be easily used in closed source programs from their employers. The MIT license allows that. Public domain even more so.

Re: (Score:2)

by EvilSS ( 557649 )

The problem with that is that no one in Congress reads the /. summary.

Re: (Score:2)

by Ksevio ( 865461 )

That's true, but I'm not sure how much that applies. There was undoubtedly human input and code is a bit different from artwork. If you take something that doesn't have copyright applied to it and then modify it, it seems like it would be able to be licensed

Re: (Score:2)

by Kisai ( 213879 )

This absolutely will apply to code, however I feel the question of "laundered" is what is really important here.

If one can merely launder a work through an LLM, to strip it of copyright, won't everyone do that to every creative work out there to effectively end copyright?

What's to stop people from making ML "covers" of music? How is this different from code? (Before you ask, yes, people have already been doing this for both the musical and lyric component of songs, making AI generated tracks of artists who

Feels wrong (Score:4, Interesting)

by liqu1d ( 4349325 )

I don't have a legal basis for this argument but it seems wrong to change the license on the project in such a manner. If the owner wanted to change it then it's fine but it appears the current maintainer isn't the owner. If they truly believe it's legally unique then it should be created under its own repo and stop providing updates to the GPL one.

Goose / Gander (Score:2)

by adrn01 ( 103810 )

Why not just take their V7 and run it through Claude again, declare it to be V8, and release under the original copyright?

Is there any reason to believe that the code would be *identical* after that second pass?

Re: Goose / Gander (Score:3)

by liqu1d ( 4349325 )

If I blend something twice is it not the same ingredients?

Re: Goose / Gander (Score:2)

by Tomahawk ( 1343 )

It likely wouldn't be. Whatever random elements are thrown into the mix as it passes through the LLM would guarantee the generated code would be different.

Not clean room (Score:4, Insightful)

by F.Ultra ( 1673484 )

This is clearly not a clean room since the LLVM was trained on the copyrighted source code and it also is not just a reimplementiaton of the API so Oracle v Google does not apply.

Re: (Score:2)

by aaronb1138 ( 2035478 )

Not hard to make it to a clean room implementation though with AI agents. You follow the same steps:

1st AI gets the code and extrapolates a complete and fully nuanced functional spec.

2nd AI gets the clean functional spec and writes net new code without ever having seen the original code.

(largely unnecessary / could be 1st) 3rd AI performs parallel unit and functional testing against both versions of code and feeds back a list of exceptions and revisions for 2nd AI to make to net new.

If you want extra sanit

Re: Not clean room (Score:2)

by Baloroth ( 2370816 )

I don't think you understand the OPs point. LLMs are trained on everything the model creator can get their hands on. That means all Internet available open source (and many non-open source) code, including chardet. An existing AI can't perform a clean-room implementation because even if you don't show it the code, *it's already seen it*. And since the training data is encoded in the weights in a very non-trivial manner, you can't specifically remove a set of training data free the fact. You'd have to train

Re: Not clean room (Score:2)

by topham ( 32406 )

You'd never be allowed to right a book after university as you would have been exposed to too much material if that were the case.

Re: (Score:2)

by aaronb1138 ( 2035478 )

Take this training dataset and sanitize it of all references and code from that project and dataset over there. Gotcha, right. Then use the resulting dataset to train a new functionally equivalent model with 0.0000000001% of its training data missing.

We'll call that a job for the 0th AI Agent.

Re: Not clean room (Score:2)

by topham ( 32406 )

Clear room implementation is *not* required.

The idea of a clean room implementation is to remove the possibility of the resulting code being in violation, however, that's not actually required to avoid being in violation. That just makes it much easier to show good faith.

If you implement a test suite, and then have the AI generate a version that complies with the test suite, it's entirely possible you are not in violation.

Symantec code validation would be prudent, but not necessarily required.

(Replace all v

Updated licensing terms required.. (Score:2)

by willy_me ( 212994 )

Looks like this demonstrates a need for updating licensing terms for open source code. Can this code be used as input to a LLM? If so, is the resulting code limited to a specific license? I can see things such as "Claude-GNU LLM" being released where the LLM can output GNU licensed code. This would be guaranteed by only using training material licensed to allow for it.

Re: Updated licensing terms required.. (Score:2)

by ByTor-2112 ( 313205 )

This is very insightful, and I hope smart people have already been working on it.

No. (Score:5, Informative)

by Local ID10T ( 790134 )

> Can copyrighted source code be laundered through an LLM and come out the other end as a fresh work of authorship, eligible for a new copyright, copyright holder, and license terms?

That is simply creating a derivative work. Derivative works generally are infringing (various exceptions exist: fair use, etc.).

> The maintainers appear to be claiming that, under the Oracle v. Google decision, which found that cloning public APIs is fair use, their v7 is a fair use re-implementation of the `chardet` public API.

This is a misrepresentation of the finding in Oracle v. Google. The finding was that APIs are not subject to copyright because they are statements of facts (e.g. function "blah" takes input integer, returns character) and are intentionally published for interoperability (like listing phone numbers in a phone book so that they can be called). How the underlying code is implemented is a separate issue.

Code may not be subject to copyright if the function can only be implemented in a particular way. If there are many ways to do a thing, then the particular way it is done may be copyrighted -and a different way of doing it would not be infringing. Either creating an different way of doing the thing if the original way is known, or by "clean-rooming" -creating a way of doing a thing knowing only the specifications would not be infringing on the copyright as similarity could be attributed to obviousness of the method and copyright protects creative expression.

Re: (Score:2)

by Sloppy ( 14984 )

> Code may not be subject to copyright if the function can only be implemented in a particular way.

Yeah, that's what patents are for!

Code translation == plagiarism (Score:2)

by Tomahawk ( 1343 )

Surely this pretty much falls under the same plagiarism rules and laws as translating code from one language to another, no?

Windows (Score:2)

by Tomahawk ( 1343 )

I took an illegally obtained copy of the source code to Windows. I put it through an LLM so that it would generate new source code. It does exactly what Windows does, but it's different code. I even had the LLM convert it to Pascal. I'm making this new source code available under an MIT licence. It's fine because it's all original work. /s

Re: (Score:2)

by jpatters ( 883 )

You may not even need the source code, it is likely that AI models are pretty close to being able to produce a specification from binaries.

And suddendly most python users... (Score:2)

by LordHighExecutioner ( 4245243 )

pip uninstall chardet

Higher accuracy (Score:2)

by spitzak ( 4019 )

The release notes claim the new version has higher accuracy, meaning it returns different (better) answers for some input strings. It seems to me that it's training can't be limited to the old code in order to achieve this. Still agree that if they fed the code to a program and told it "write a better version" then the copyright of the original code still applies. They may also have just generated a lot of strings and in some (many) cases fed them to the old program and told the AI to make a program that pr

Take this just one small step further... (Score:1)

by djshaffer ( 595950 )

If this is legal, I can train an LLM on not the source, but instead the binary code, of an existing program and create a non-infringing clone. Open source clone of Microsoft Word, anyone?

Clean Room? (Score:2)

by jpatters ( 883 )

One can contemplate that it would be possible to do some sort of "clean room" implementation where you input some source code (or even an executable) to one AI system that then outputs a specification, and then feed the specification to a different AI system to produce a new source code output. However, the result shouldn't be copyrightable at all because it is not the result of human authorship.

News: 0180919356

Python 'Chardet' Package Replaced With LLM-Generated Clone, Re-Licensed

Why not apply this to code as well? (Score:5, Insightful)

Re:Why not apply this to code as well? (Score:4, Interesting)

Re:Why not apply this to code as well? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Feels wrong (Score:4, Interesting)

Goose / Gander (Score:2)

Re: Goose / Gander (Score:3)

Re: Goose / Gander (Score:2)

Not clean room (Score:4, Insightful)

Re: (Score:2)

Re: Not clean room (Score:2)

Re: Not clean room (Score:2)

Re: (Score:2)

Re: Not clean room (Score:2)

Updated licensing terms required.. (Score:2)

Re: Updated licensing terms required.. (Score:2)

No. (Score:5, Informative)

Re: (Score:2)

Code translation == plagiarism (Score:2)

Windows (Score:2)

Re: (Score:2)

And suddendly most python users... (Score:2)

Higher accuracy (Score:2)

Take this just one small step further... (Score:1)

Clean Room? (Score:2)