AI Has Already Run Out of Training Data, Goldman's Data Chief Says (businessinsider.com)
- Reference: 0179632574
- News link: https://slashdot.org/story/25/10/02/191224/ai-has-already-run-out-of-training-data-goldmans-data-chief-says
- Source link: https://www.businessinsider.com/ai-training-data-shortage-slop-goldman-sachs-2025-10
Developers have been using synthetic data -- machine-generated material that offers unlimited supply but carries quality risks. Raphael said he doesn't think the lack of fresh data will be a massive constraint. "From an enterprise perspective, I think there's still a lot of juice I'd say to be squeezed in that," he said. Proprietary datasets held by corporations could make AI tools far more valuable. The challenge is "understanding the data, understanding the business context of the data, and then being able to normalize it."
[1] https://www.businessinsider.com/ai-training-data-shortage-slop-goldman-sachs-2025-10
Copyingis easier than innovation (Score:1)
It is very easy to copy other people, hard to innovate. One of the reasons the US has kept up with China.
Similarly, it is very hard for someone to create something smarter than they are. One of the reasons I am not afraid of AI - it will never become smarter than a human. It's speed of growth is because it is following in our footsteps, that does not translate to breaking new ground.
Re: (Score:2)
Not everything is about smarts though, some of it is just volume.
How many polymer formation can you model and observe properties of in the search for some new plastic material for a specific application. The AI does not have to be smarter, it just has to be faster than the human material scientists before it.
Re: (Score:2)
> Not everything is about smarts though, some of it is just volume.
> How many polymer formation can you model and observe properties of in the search for some new plastic material for a specific application. The AI does not have to be smarter, it just has to be faster than the human material scientists before it.
In the end though, that's just (a set of) new algorithms. The FFT already transformed the world. The ability of mobile phones to handover from one cell to another automatically transformed communication. Being able to fold proteins more efficiently will transform biology. There are lots of these examples. They are important and valuable but they aren't smart.
None of them comes close to the effect that an AGI would have. If you could get something as smart as a cat, it's reasonably likely that small tweaks a
Got enough for bootstrapping (Score:2)
AI's model weights (like a person's brain) don't need to store everything - the weights (the bare llm) need to store enough to read instructions, seach databases and documents, and recognize things. Which they already do. Once the AI is "out in the world," it will gather its own subsequent data.
For example, how does a self-driving car 'run out of training data'? They are gathering vast amounts every day from their already-deployed cars. Probably more than they can handle.
Same with call center AI.
Re: (Score:3)
Tell me you have no clue how AI training works without telling me you have no clue how AI training works ...
Re: (Score:2)
Say a self driving car needs to sample each road 10,000 times to learn how to drive it. It may have sampled a major highway 50,000 times but it's not going to learn anything extra so it has run out of training data. Now say there is a little dirt road that a self driving car has only gone down twice. It has run out of training data to be able to drive that particular road. There can be thousands of cars driving around but if they aren't taken away from the places where enough is already known and put in
"synthetic data" (Score:2)
"synthetic data" is made-up crap. At BEST it can only ever give results that an already existing dataset can give. It can't deal with the possibility that future REAL data may be different than the existing data it was based on. [sarcasm]Shockingly, future data sometimes DOES differ from old data in the real world.[/sarcasm]
Re: (Score:2)
> "synthetic data" is made-up crap.
Indeed. Synthetic data means you could have just put the assumptions the data was generated on into some more reliable AI approaches than an LLM. Newsflash: That was possible before. It was just way too expensive. That constraint remains.
Also, if you put stupid in, putting some data-synthesis and then LLM training process in its path, it will still be stupid, just with hallucinations on top.
How about "No." (Score:2)
> "From an enterprise perspective, I think there's still a lot of juice I'd say to be squeezed in that," he said. Proprietary datasets held by corporations could make AI tools far more valuable. The challenge is "understanding the data, understanding the business context of the data, and then being able to normalize it."
Most business owners have become aware of how tech companies treat any data that they get ahold of. I don't know of many that will be all-aboard for allow these AI companies full access to their corporate owned proprietary datasets to train their public models. And while some businesses will happily allow local / company owned data center loaded AI train on their proprietary datasets, they're not going to sign off willingly on letting that training data escape the company network. So, they'll be able to tra
Re: (Score:2)
Interesting comments. The quote you chose is where the future lies, imho. The idea that all data is good data and that ChatGPT is all things to all people was naively flawed from the beginning. Based on my experience, businesses want deterministic answers to specific questions, not a distribution of "sort of" answers. What I'm reading from calling data "synthetic" means TELLING the model what is right or wrong. You must bias your model, also known as "teaching" your model. The idea that all you have to do i
Re: (Score:2)
> Interesting comments. The quote you chose is where the future lies, imho. The idea that all data is good data and that ChatGPT is all things to all people was naively flawed from the beginning. Based on my experience, businesses want deterministic answers to specific questions, not a distribution of "sort of" answers. What I'm reading from calling data "synthetic" means TELLING the model what is right or wrong. You must bias your model, also known as "teaching" your model. The idea that all you have to do is give the model EVERYTHING then it will magically be "smart" is ludicrous from inception. Marketing 101 says that the more tightly you define your market, the more accurately you can provide a solution, and the more money you can charge for it. i.e. Autocad. What I think you're saying is that companies want to train LLMs or whatever it is, on their corporate knowledge and culture and data and restrictions. They want the AI customized to their use case. That seems logical to me. The smart ones will do everything themselves and not allow Big Brains anywhere near their data or model. Work from first principles, own your valuable intellectual property. Don't give it away, chumps.
Yes, spinning up open, non corporate models and "teaching" them in the business is an option for corporations. Buying proprietary models, or worse, using "cloud" (other people's) models as a catch-all, will never lead to gains for the businesses involved, only gains in data aggregation for the companies selling the models.
I do think there's some value in domain specific AIs being trained within businesses. I've done a minor bit of that myself for one of my hobbies, feeding one my fictional universe and "cor
AI belongs to the Big four (Score:2)
Google, Apple Microsoft and facebook. They're the only ones that own large enough platforms to reliably source good training data and have control of those platforms in a way that they can tell the difference between a real user and slop.
AI has a lot of problems and a lot of reasons it's only going to be bad for humanity. But one of those reasons is the way training data works AI automatically consolidates around a couple of big players leaving everybody else out in the cold.
That means no real comp
No surprise (Score:3)
This is entirely expected. It is being reached (probably some time ago) is one of the indicators that this stupid hype may not go on much longer. It also means that what we have now is the max we will get in capabilities for the foreseeable future. Except for very specialized models. Maybe.
The limiting factor is algorithms (Score:2)
There is no way that the existing data is insufficient for "AI". All of wikipedia contains more knowledge than any human can absorb, and this isn't anywhere near the amount of data the big companies have for training. The data exists.
The limiting factor is that there isn't an algorithm to take advantage of that data. Current methods are just statistical models that try to duplicate existing patterns (and they do this well). They work merely because of a brute-force use of gigantic data sets gives reasonably
Ok, here's an idea... (Score:2)
Old people know lots of things and are lonely - send the boffins in to chat to old people and train models. Old people get to chat, companies get to make money
"proprietary datasets" (Score:2)
"Proprietary datasets, like those that come from businesses' data, may hold the key to the data hole."
And now you know why Microsoft and Google are doing everything they can to force you at gunpoint to store your confidential and proprietary business information, personal financial information, personal health information, personal letters, photos, and everything else that should be held in strict confidence, in their cloud for "FREE" for some reason.
Is it Halloween already ? (Score:2)
The LLM zombie is yearning for more brains.
The withered teat has been sucked dry (Score:1)
and the future is safe from Skynet.
How angry are you? (Score:2)
Rage against the AI machine? People want wise oracles, but the AIs are stupid and some people are therefore learning to limit their thinking to what the AI is good at talking about. Even worse, some folks think the YUGE Orange Buffoon is some kind of gawd king...
How about updating the old song to "AI can do anything better than you"?
(And I think all the ACs could be replaced with a genAI set to the style of "stupid".)