- | 5:14 pm
AI-models at risk of ‘collapsing’ from AI-generated data, new study finds
Essentially, these models can start to lose track of the original information, after a few generations, producing outputs that are increasingly detached from reality
Artificial intelligence (AI) models fed solely on computer-generated information risk ‘collapsing,’ generating repetitive content in a continuous loop, a new study found.
The study, titled ‘AI models collapse when trained on recursively generated data’, warned that this could have significant implications for the future development of AI.
The research was carried out by a team of six researchers comprising Ilia Shumailov and Yarin Gal from the University of Oxford, Zakhar Shumaylov and Ross Anderson from the University of Cambridge, Yiren Zhao from Imperial College London, and Nicolas Papernot from the University of Toronto.
The researchers posited that if the new generative AI models are trained using texts and images generated by their previous versions, or older models, the results will inevitably be less accurate and diverse, than if they were trained using human-generated material, causing a ‘model collapse’.
The study, published in Nature, also stated that the development of these Large Language Models (LLMs) is “very involved and requires large quantities of training data.”
“Yet, although current LLMs 2,4,5,6, including GPT-3, were trained on predominantly human-generated text, this may change,” said the study.
Essentially, these models can start to lose track of the original information, after a few generations, producing outputs that are increasingly detached from reality.
“Over time, models start losing information about the true distribution, which first starts with tails disappearing, and learned behaviors converge over the generations to a point estimate with very small variance,” the study noted.
This, according to the researchers, can happen due to a myriad of factors, including limited data due to which models can lose information over time, the complexity of the models themselves which causes them to misinterpret information, and the methods used to train them.
As these errors accumulate across generations of AI models, the problem worsens, leading to a decline in performance.
“In other words, the use of LLMs at scale to publish content on the Internet will pollute the collection of data to train their successors: data about human interactions with LLMs will be increasingly valuable,” said the researchers.
While the findings highlighted the importance of careful data generation in AI training, the authors did acknowledge that most AI models are trained on a mix of human and synthetic data generation, reducing the risk of model collapse.