Boffins detail new method to make neural nets forget private and copyrighted info
(2025/09/05)
- Reference: 1757030369
- News link: https://www.theregister.co.uk/2025/09/04/boffins_detail_ai_mind_wipe/
- Source link:
Researchers have found promising new ways to have AI models ignore copyrighted content, suggesting it may be possible to satisfy legal requirements without going through the lengthy and costly process of retraining models.
Training AI models requires huge quantities of data, which model-makers have acquired by scraping the internet without first asking for permission and by [1]allegedly knowingly downloading copyrighted books.
Those practices have seen model makers [2]sued in many copyright cases, and also raised eyebrows at regulators who wonder whether AI companies can comply with the General Data Protection Regulation right to erasure (often called the right to be forgotten) and the California Consumer Privacy Act right to delete.
[3]
The easiest way to address these issues is to retrain models without legally risky data, but that would require GPUs to work for tens of millions of hours at great expense, which isn’t going to happen.
[4]
[5]
Researchers have therefore [6]investigated more efficient methods to make models forget or unlearn information, ideally without [7]lobotomizing them in the process.
Many of these methods assume access to the original training data, something that Basak Guler, an Assistant Professor at the University of California, Riverside, warns isn't guaranteed.
[8]
"It might not always be possible to keep the original dataset right," she explained.
To address this particular problem, Guler along with her colleagues professor Amit Roy-Chowdhury, Ümit Yiğit Başaran, a doctoral student studying electrical and computer engineering, and Sk Miraj Ahmed, a researcher at Brookhaven National Laboratory developed a new, computationally efficient approach called source free unlearning, which critically doesn't require access to the original training data to statistically guarantee the removal of undesired information from a model.
The method, detailed in a recent [9]paper titled "A Certified Unlearning Approach without Access to Source Data," builds on prior methods by using the surrogate dataset to guide a technique called a single-step Newton update to modify the model. However, this alone may not be enough to scrub the target data from the model, Guler notes, so a carefully calibrated amount of random noise is then applied to ensure that the target information can't be reconstructed.
[10]
In testing, researchers showed the approach could achieve results comparable to full retraining, while using a fraction of the computational power.
While a step towards enabling the efficient removal of private or copyrighted materials from models is welcome, there's still more work to be done to apply the researchers' findings to the kinds of large language models (LLMs) that power popular chatbot services and which have become the subject of numerous copyright [11]lawsuits .
[12]Biased bots: AI hiring managers shortlist candidates with AI resumes
[13]It looks like you're ransoming data. Would you like some help?
[14]Total recall: Mistral AI's Le Chat can now remember your conversations
[15]Goldman Sachs warns AI bubble could burst datacenter boom
So far, most of the UC Riverside researchers' work has focused on less complex machine learning models like classifiers, rather than full-blown LLMs.
"We don't claim that we have solved everything," Guler said, emphasizing that her team's research addresses just one piece of a much bigger puzzle."
The hope is that researchers will be able to build on their findings. One area in particular requiring additional study is how best to design the surrogate dataset.
"There are all these interesting questions that we need to, one by one, address. I'm not expecting us to be able to answer all of them immediately, but I think it's a good first step," Guler said. ®
Get our [16]Tech Resources
[1] https://www.theregister.com/2025/06/27/meta_llama_author_lawsuit/
[2] https://www.theregister.com/2025/08/26/perplexity_asahi_nikkei_lawsuits/
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aLpgVJrfVMhPMUteye6FLQAAAEs&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aLpgVJrfVMhPMUteye6FLQAAAEs&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aLpgVJrfVMhPMUteye6FLQAAAEs&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://www.theregister.com/2023/07/13/ai_models_forgotten_data/
[7] https://www.theregister.com/2019/07/15/ai_delete_data/
[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aLpgVJrfVMhPMUteye6FLQAAAEs&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[9] https://arxiv.org/abs/2506.06486
[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aLpgVJrfVMhPMUteye6FLQAAAEs&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[11] https://www.theregister.com/2025/08/26/perplexity_asahi_nikkei_lawsuits/
[12] https://www.theregister.com/2025/09/03/ai_hiring_biased/
[13] https://www.theregister.com/2025/09/03/ransomware_ai_abuse/
[14] https://www.theregister.com/2025/09/02/mistral_ais_le_chat_can/
[15] https://www.theregister.com/2025/09/02/goldman_sachs_ai_datacenters/
[16] https://whitepapers.theregister.com/
Training AI models requires huge quantities of data, which model-makers have acquired by scraping the internet without first asking for permission and by [1]allegedly knowingly downloading copyrighted books.
Those practices have seen model makers [2]sued in many copyright cases, and also raised eyebrows at regulators who wonder whether AI companies can comply with the General Data Protection Regulation right to erasure (often called the right to be forgotten) and the California Consumer Privacy Act right to delete.
[3]
The easiest way to address these issues is to retrain models without legally risky data, but that would require GPUs to work for tens of millions of hours at great expense, which isn’t going to happen.
[4]
[5]
Researchers have therefore [6]investigated more efficient methods to make models forget or unlearn information, ideally without [7]lobotomizing them in the process.
Many of these methods assume access to the original training data, something that Basak Guler, an Assistant Professor at the University of California, Riverside, warns isn't guaranteed.
[8]
"It might not always be possible to keep the original dataset right," she explained.
To address this particular problem, Guler along with her colleagues professor Amit Roy-Chowdhury, Ümit Yiğit Başaran, a doctoral student studying electrical and computer engineering, and Sk Miraj Ahmed, a researcher at Brookhaven National Laboratory developed a new, computationally efficient approach called source free unlearning, which critically doesn't require access to the original training data to statistically guarantee the removal of undesired information from a model.
The method, detailed in a recent [9]paper titled "A Certified Unlearning Approach without Access to Source Data," builds on prior methods by using the surrogate dataset to guide a technique called a single-step Newton update to modify the model. However, this alone may not be enough to scrub the target data from the model, Guler notes, so a carefully calibrated amount of random noise is then applied to ensure that the target information can't be reconstructed.
[10]
In testing, researchers showed the approach could achieve results comparable to full retraining, while using a fraction of the computational power.
While a step towards enabling the efficient removal of private or copyrighted materials from models is welcome, there's still more work to be done to apply the researchers' findings to the kinds of large language models (LLMs) that power popular chatbot services and which have become the subject of numerous copyright [11]lawsuits .
[12]Biased bots: AI hiring managers shortlist candidates with AI resumes
[13]It looks like you're ransoming data. Would you like some help?
[14]Total recall: Mistral AI's Le Chat can now remember your conversations
[15]Goldman Sachs warns AI bubble could burst datacenter boom
So far, most of the UC Riverside researchers' work has focused on less complex machine learning models like classifiers, rather than full-blown LLMs.
"We don't claim that we have solved everything," Guler said, emphasizing that her team's research addresses just one piece of a much bigger puzzle."
The hope is that researchers will be able to build on their findings. One area in particular requiring additional study is how best to design the surrogate dataset.
"There are all these interesting questions that we need to, one by one, address. I'm not expecting us to be able to answer all of them immediately, but I think it's a good first step," Guler said. ®
Get our [16]Tech Resources
[1] https://www.theregister.com/2025/06/27/meta_llama_author_lawsuit/
[2] https://www.theregister.com/2025/08/26/perplexity_asahi_nikkei_lawsuits/
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aLpgVJrfVMhPMUteye6FLQAAAEs&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aLpgVJrfVMhPMUteye6FLQAAAEs&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aLpgVJrfVMhPMUteye6FLQAAAEs&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://www.theregister.com/2023/07/13/ai_models_forgotten_data/
[7] https://www.theregister.com/2019/07/15/ai_delete_data/
[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aLpgVJrfVMhPMUteye6FLQAAAEs&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[9] https://arxiv.org/abs/2506.06486
[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aLpgVJrfVMhPMUteye6FLQAAAEs&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[11] https://www.theregister.com/2025/08/26/perplexity_asahi_nikkei_lawsuits/
[12] https://www.theregister.com/2025/09/03/ai_hiring_biased/
[13] https://www.theregister.com/2025/09/03/ransomware_ai_abuse/
[14] https://www.theregister.com/2025/09/02/mistral_ais_le_chat_can/
[15] https://www.theregister.com/2025/09/02/goldman_sachs_ai_datacenters/
[16] https://whitepapers.theregister.com/
Pffft, easy!
Yorick Hunt
rm -rf /
Ahhh. The need to publish is still as strong as ever. Add in some keywords (AI, LLM, copyright)
and we've got a guaranteed path to the paper mills.
Apologies for sounding a bit negative.
I think all rules regarding reversing entropy would be broken while trying to put the individual eggs back together from the scramble.
"Source free unlearning", "single-step Newton update", and a dash of random noise will do the trick. I'm off to the pub for my hallucinations.