Don't spill your guts to your chatbot friend - it'll hoover up that info for training
(2025/11/20)
- Reference: 1763663742
- News link: https://www.theregister.co.uk/2025/11/20/experts_warns_house_of_privacy/
- Source link:
The US House of Representatives has heard that LLM builders can exploit users’ conversations for further training and commercial benefit with little oversight or concern for privacy risks.
As President Trump [1]seeks to prevent states from introducing and enforcing legislation governing the application of AI, Jennifer King, Stanford privacy and data policy fellow, said there was little to no transparency into how AI developers collect and process the data they use for model training.
On Tuesday, she [2]told the House Energy and Commerce Subcommittee on Oversight and Investigations that “we should not assume that they're taking reasonable precautions to prevent incursions into consumers' privacy. Users should not be automatically opted in to having their data used in model training, and developers should proactively remove sensitive data from training sets.”
[3]
Under current rules, there are no requirements for developers to understand the full data pipeline - “how it is cleaned, how we remove personal information from it, and then, how it is used again for retraining,” she said.
[4]
[5]
For a start, the data could be used for targeted advertising. Then there is no way of knowing how personal data would be used to train LLMs.
“From the study I did recently, we really don't understand right now to what extent the companies are potentially cleaning that data before it is used for retraining, and there is research demonstrating, including research by employees of the large companies, that chatbots can memorize training data,” she said.
[6]
While foundation models were first built on publicly available data scraped from the internet - some of it under copyright - developers are running out of English-language data to continue scraping.
The scarcity is driving the need to find other sources, including data in user conversations, King said. “As we interact with these chatbots, the concern is that we are disclosing far more personal information in these exchanges than we may have in - let's say - web search."
[7]OpenAI releases bot-tom feeding browser with ChatGPT built in
[8]Psylo browser tries to obscure digital fingerprints by giving every tab its own IP address
[9]LLM chatbots trivial to weaponize for data theft, say boffins
[10]OpenAI removes ChatGPT self-doxing option
“I could ask a chatbot for health advice, for example, and disclose a lot more detail in that back and forth than I might have just in a search query or two. And as far as we know, that is all included in training data, except in the cases where companies may proactively try to exclude some of that data. But again, there's very little evidence that most of them are proactively doing that work,” King said.
Where chatbots are built or deployed by other platform providers, they are likely to use a gamut of personal data as a commercial asset.
“If we're talking about a foundation model developed by a pre-existing older tech company, they already mostly have profiles on their users. They are potentially collecting behavioral data from across the internet. We know, in some cases, they are already looking to use that data in their chatbot discussions, especially as we start to look towards explicit advertising. I know companies are considering that now. So your past shopping experience may feed into the recommendations you get from a chatbot,” she said.
[11]
While LLM-centric companies might not have the wealth of users' data available to the larger platform providers (Meta, Google, etc.), some are trying to move in that direction. For example, OpenAI has already launched a browser and has publicly announced that it’s developing a hardware product. “We're at the very beginning of this,” King said. ®
Get our [12]Tech Resources
[1] https://www.theregister.com/2025/11/20/trump_republicans_trying_again_to/
[2] https://www.youtube.com/watch?v=krl4GxfqvsI
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aR-di5KtlylGDLC1lGLTnAAAAMQ&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aR-di5KtlylGDLC1lGLTnAAAAMQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aR-di5KtlylGDLC1lGLTnAAAAMQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aR-di5KtlylGDLC1lGLTnAAAAMQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[7] https://www.theregister.com/2025/10/22/openai_crams_chatgpt_into_atlas/
[8] https://www.theregister.com/2025/06/24/psylo_browser_privacy_tab_silos/
[9] https://www.theregister.com/2025/08/15/llm_chatbots_trivial_to_weaponise/
[10] https://www.theregister.com/2025/08/01/openai_removes_chatgpt_selfdoxing_option/
[11] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aR-di5KtlylGDLC1lGLTnAAAAMQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[12] https://whitepapers.theregister.com/
As President Trump [1]seeks to prevent states from introducing and enforcing legislation governing the application of AI, Jennifer King, Stanford privacy and data policy fellow, said there was little to no transparency into how AI developers collect and process the data they use for model training.
On Tuesday, she [2]told the House Energy and Commerce Subcommittee on Oversight and Investigations that “we should not assume that they're taking reasonable precautions to prevent incursions into consumers' privacy. Users should not be automatically opted in to having their data used in model training, and developers should proactively remove sensitive data from training sets.”
[3]
Under current rules, there are no requirements for developers to understand the full data pipeline - “how it is cleaned, how we remove personal information from it, and then, how it is used again for retraining,” she said.
[4]
[5]
For a start, the data could be used for targeted advertising. Then there is no way of knowing how personal data would be used to train LLMs.
“From the study I did recently, we really don't understand right now to what extent the companies are potentially cleaning that data before it is used for retraining, and there is research demonstrating, including research by employees of the large companies, that chatbots can memorize training data,” she said.
[6]
While foundation models were first built on publicly available data scraped from the internet - some of it under copyright - developers are running out of English-language data to continue scraping.
The scarcity is driving the need to find other sources, including data in user conversations, King said. “As we interact with these chatbots, the concern is that we are disclosing far more personal information in these exchanges than we may have in - let's say - web search."
[7]OpenAI releases bot-tom feeding browser with ChatGPT built in
[8]Psylo browser tries to obscure digital fingerprints by giving every tab its own IP address
[9]LLM chatbots trivial to weaponize for data theft, say boffins
[10]OpenAI removes ChatGPT self-doxing option
“I could ask a chatbot for health advice, for example, and disclose a lot more detail in that back and forth than I might have just in a search query or two. And as far as we know, that is all included in training data, except in the cases where companies may proactively try to exclude some of that data. But again, there's very little evidence that most of them are proactively doing that work,” King said.
Where chatbots are built or deployed by other platform providers, they are likely to use a gamut of personal data as a commercial asset.
“If we're talking about a foundation model developed by a pre-existing older tech company, they already mostly have profiles on their users. They are potentially collecting behavioral data from across the internet. We know, in some cases, they are already looking to use that data in their chatbot discussions, especially as we start to look towards explicit advertising. I know companies are considering that now. So your past shopping experience may feed into the recommendations you get from a chatbot,” she said.
[11]
While LLM-centric companies might not have the wealth of users' data available to the larger platform providers (Meta, Google, etc.), some are trying to move in that direction. For example, OpenAI has already launched a browser and has publicly announced that it’s developing a hardware product. “We're at the very beginning of this,” King said. ®
Get our [12]Tech Resources
[1] https://www.theregister.com/2025/11/20/trump_republicans_trying_again_to/
[2] https://www.youtube.com/watch?v=krl4GxfqvsI
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aR-di5KtlylGDLC1lGLTnAAAAMQ&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aR-di5KtlylGDLC1lGLTnAAAAMQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aR-di5KtlylGDLC1lGLTnAAAAMQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aR-di5KtlylGDLC1lGLTnAAAAMQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[7] https://www.theregister.com/2025/10/22/openai_crams_chatgpt_into_atlas/
[8] https://www.theregister.com/2025/06/24/psylo_browser_privacy_tab_silos/
[9] https://www.theregister.com/2025/08/15/llm_chatbots_trivial_to_weaponise/
[10] https://www.theregister.com/2025/08/01/openai_removes_chatgpt_selfdoxing_option/
[11] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aR-di5KtlylGDLC1lGLTnAAAAMQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[12] https://whitepapers.theregister.com/