New 'Open Source AI Definition' Criticized for Not Opening Training Data (slashdot.org)
- Reference: 0175383787
- News link: https://news.slashdot.org/story/24/11/03/0257241/new-open-source-ai-definition-criticized-for-not-opening-training-data
- Source link: https://slashdot.org/submission/17329255/community-commitment-to-open-source-definition
This move follows some [7]discussion on the Debian mailing list:
> Allowing "Open Source AI" to hide their training data is nothing but setting up a "data barrier" protecting the monopoly, disabling anybody other than the first party to reproduce or replicate an AI. Once passed, OSI is making a historical mistake towards the FOSS ecosystem.
They're not the only ones worried about data. This week [8] TechCrunch noted an [9]August study which "found that many 'open source' models are basically open source in name only. The data required to train the models is kept secret, the compute power needed to run them is beyond the reach of many developers, and the techniques to fine-tune them are intimidatingly complex. Instead of democratizing AI, these 'open source' projects tend to entrench and expand centralized power, the study's authors concluded."
[10]samj shares the concern about training data, arguing that training data is the source code and that this new definition [11]has real-world consequences . (On a personal note, he says it "poses an existential threat to our pAI-OS project at the non-profit Kwaai Open Source Lab I volunteer at, so we've been very active in pushing back past few weeks.")
And he also came up with a detailed response [12]by asking ChatGPT . What would be the implications of a Debian disavowing the OSI's Open Source AI definition? ChatGPT composed a 7-point, 14-paragraph response, concluding that this level of opposition would "create challenges for AI developers regarding licensing. It might also lead to a fragmentation of the open-source community into factions with differing views on how AI should be governed under open-source rules." But "Ultimately, it could spur the creation of alternative definitions or movements aimed at maintaining stricter adherence to the traditional tenets of software freedom in the AI age."
However the [13]official FAQ for the new Open Source AI definition argues that training data "does not equate to a software source code."
> Training data is important to study modern machine learning systems. But it is not what AI researchers and practitioners necessarily use as part of the preferred form for making modifications to a trained model.... [F]orks could include removing non-public or non-open data from the training dataset, in order to train a new Open Source AI system on fully public or open data...
>
> [W]e want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information — like decisions about their health. Similarly, much of the world's Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.
[14] Read on for the rest of their response...
[1] https://www.slashdot.org/~samj
[2] https://nm.debian.org/person/samj/
[3] https://slashdot.org/submission/17329255/community-commitment-to-open-source-definition
[4] https://news.slashdot.org/story/24/10/28/1811209/we-finally-have-an-official-definition-for-open-source-ai
[5] https://en.wikipedia.org/wiki/Debian_Free_Software_Guidelines
[6] https://opensourcedeclaration.org/
[7] https://lists.debian.org/debian-ai/2024/10/msg00149.html
[8] https://techcrunch.com/2024/10/28/we-finally-have-an-official-definition-for-open-source-ai/
[9] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4543807
[10] https://www.slashdot.org/~samj
[11] https://samjohnston.org/2024/10/22/debian-general-resolution-gr-drafted-opposing-osis-open-source-ai-definition-osaid/
[12] https://chatgpt.com/share/67175283-5b64-800f-8ba9-a2d883e26bc3
[13] https://hackmd.io/@opensourceinitiative/osaid-faq
[14] https://news.slashdot.org/story/24/11/03/0257241/new-open-source-ai-definition-criticized-for-not-opening-training-data#From_OSAID_FAQ
::slow clap:: (Score:5, Interesting)
The silicon valley fanboys perched firmly on on the tip of software giants' smelly knobs have been unknowingly reinforcing the bullshit adoption of the term "open-source" as a marketing term to silence critics long enough to build a moat and charge everyone to cross it. Easy! Buy off the nerds with cheap furry hentai and fan art generators, and the promise of on-demand video game creation, buy off the marketers with cheap garbage SEO content generators, and buy off executives with low-cost, low-quality labor replacement subsidized by investors, and they'll start defending this shit like they built it themselves. Just wait for the real price tags to show up when you have to start paying for the exponentially larger amounts of electricity and data wrangling for product improvement.
The dirty secret of LLMs is the training data (Score:5, Insightful)
Most of it was scraped illegally without consent.. Parts are illegal to even possess or continue to store after training. Parts of it would cause _huge_ liability issues if the ones it was stolen from find out. Most of it is crap.
There is really no surprise nobody wants to make their training data public.
Re: The dirty secret of LLMs is the training data (Score:2)
It would also really ruin the mystique if people could see the specific handful of human-made items that their prompts munge to compose "their" creation. I am extremely pro--gen-AI. This technology is amazing and is changing the world for the better. It's used in tooling that I useâ" with small, purpose-built, self-trained modelsâ" to save countless hours performing tedious, menial tasks. We've already seen novel cancer research boosted by it. Don't conflate that with vacuuming up the entire cre
Re: (Score:2)
Scraped illegally? From a PUBLIC website?
Public means public.
If they bypassed security to access a private site, that would be illegal.
Re: (Score:2)
Public does _not_ mean "take it, sell it, do whatever you want with it". Are you a moron?
Re: (Score:2)
Before I ramble on, I am curious what gweihir, do you condemn arvhive.org the The Internet Archive Wayback Machine in the same way?
It seems they more than anyone "take it, sell it, do whatever [they] want with it"; and I think they are great and do not want them to stop!
I have never been convinced that large language models violate copyright.
I would say that "public" is at best only restricted to large exact copies, and maybe only then an attribution is needed to be kosher.
I learned to write essentially fro
Re: (Score:2)
> Before I ramble on, I am curious what gweihir, do you condemn arvhive.org the The Internet Archive Wayback Machine in the same way?
Stop putting up strawmen. The Internet archive falls under the "search" exception, that is in place because consent to be found can be assumed when things are placed online. The archive is a bit of a border-case, but still covered by the law. Yes, there is a law in place to allow search engines and archives and they would be illegal otherwise. I do suspect that the Internet archive would be illegal if they put up a paywall though. And I do suspect that of a search-engine put up that paywall, they would not
Re: (Score:2)
I appreciate your response.
It is interesting to assert that archive.org is a strawman, and I am more convinced then ever it is not; at minimum it helps narrow my understanding of your initial post.
Perhaps it is because I mostly use ChatGPT in place of Google now, it is simply a search engine to me with a better interface.
In addition, the Internet Archive is the most extreme example, it serves up in whole entire web sites; Google no longer does that (miss the cached results), only in part, and ChatGPT and th
Re: (Score:2)
Most contents available on public websites are under "all rights reserved" licence. Even images on Wikipedia sometimes have non-free licence so you are not supposed to reuse them for commercial purposes (most LLM are commercial).
Re: (Score:2)
Does it mean, that a professional graphic designer is not allowed to look at them?
Re: (Score:2)
The law understands the difference between a person and a machine, even if you do not.
Re: The dirty secret of LLMs is the training data (Score:3)
Actually, the law doesn't. Not with respect to training data.
Re:LOL wut? (Score:2)
> Scraped illegally? From a PUBLIC website? Public means public. If they bypassed security to access a private site, that would be illegal.
Just say you dont understand how property rights work. You know everytime you walk past your neighbor's house and see something laying there in the easement of the sidewalk and his lawn, and just take it, because "ITS NOT ON HIS PROPERTY!" okies.
Re: (Score:2)
> There is really no surprise nobody wants to make their training data public.
That not true, see for example [1]sigma.ai [sigma.ai] or [2]kaggle [kaggle.com].
How useful these data are compared to closed data (that possibly were scraped illegally) is a different matter entirely.
[1] https://sigma.ai/open-datasets/
[2] https://www.kaggle.com/datasets
Re: (Score:2)
These are not training datasets for general LLMs. Obviously, there are public datasets. Also, if you have a look at the datasets sigma.ai links (!), you will find that there are various usage limitations.
Re: (Score:2)
> There is really no surprise nobody wants to make their training data public.
But the OSI should have made two levels of licences (or more) like Creative Commons made (CC0, BY, NC, SA). At least academics are interested in benchmarking their developments against a number of specific datasets. It could be "the images uploaded to Wikimedia Commons by Nov 1 2024" or "the proceedings of the EU Parliament in its 24 human-generated translations, between two dates". Or it could point at a specific folder with terabytes of data that other academics would back up. Even though these could be c
Re: (Score:2)
> Most of it was scraped illegally without consent.. Parts are illegal to even possess or continue to store after training. Parts of it would cause _huge_ liability issues if the ones it was stolen from find out. Most of it is crap.
> There is really no surprise nobody wants to make their training data public.
Add the that the massaging they do to match whichever the direction the creators lean on various topics.
Re: (Score:2)
> Most of it was scraped illegally without consent.
I mean, isn't whether it is illegal or not determined in courts, and on a case by case basis? IF that is correct, and the necessary cases have not pulled through with some kind of outcome yet, how can you or I say whether it was either illegal, OR legal?
Re: (Score:2)
In this case, it is illegal until a law is made that makes it legal.
!Intelligent (Score:3)
If the data is what matters instead of the code, that's "informed", not "intelligent".
Open weights (Score:2)
From the reasoning quotes in the TFS:
> [W]e want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI.
That's fine. Just share it under a different moniker not diluting "open source". For instance, "open weights" seems to already be in use by quite a few people, and feels fairly descriptive of the actual situation.
> Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests.
That's fine. Similarly, both op
Re: (Score:2)
Open source should probably refer to the actual source code. Absolutely make up other names for other situations.
If someone writes a program that does spline interpolation and distributes the source, but uses hardcoded spline coefficients, is that not open source? Or a marching cubes implementation with its lookup table? Those coefficients are the same (actually the same, not just analogous) to the weights in a deep learning model.
Software that depends on difficult to understand and reproduce ancillary data
When "open" doesn't really mean "open" (Score:2)
This nonsense is a convoluted definition that's been contrived to allow AI/LLM companies to cash in on the collective work of hundreds of millions of people and claim that what they're doing is open while not actually making it open in any functional sense. There are two reasons for this.
First: for the most part, the engines aren't all that important. Anyone with a modest understanding of language models, neural networks, inference engines, ettc. can write their own, and many people are doing just that
Levels of Open (Score:2)
I of course asked ChatGPT what key components make up the creation and use of a model:
Training Data, Preprocessing and Data Pipeline, Training Configuration, Training Script, Model Checkpoints, Base Model (if applicable), Fine-tuning / Specialized Training, Trained Model, Inference Code, Deployment Pipeline, Evaluation and Testing Metrics, Post-processing.
I would say Training Data has been the most controversial aspect of AI creation, followed by censoring that may take place in a handful of the steps from
Re: (Score:2)
Well, I finally took the time to skim the summary....
It really feels like the word "Open" is lost, at least for AI; given OpenAI is not fully Open from what I understand.
If the Open Source Initiative (OSI), "a long-running institution aiming to define and “steward” all things open source", does not "properly" define Open AI as having an open data set, then perhaps it is time to move beyond "Open" and cut the legs out from underneath a seemingly corrupted organization.
I like the word Libre, how a
Re: Levels of Open (Score:3)
OpenAI has been in a flat-out dash to become as closed as possible for a long while, and they're speeding upâ" not slowing down. "Open" is short for "open your wallet."
Re: (Score:2)
What if we separate the AI engine / kernel and "the AI"? This would solve it. The engine may be an open source, but the AI based on the open engine and closed training data is by no means "open".