Artificial Intelligence (AI) has undeniably revolutionized various sectors, from healthcare to retail to entertainment and art. However, a new study indicates that we may have reached a critical juncture: AI learning from AI-generated content.
A warning has been raised by researchers from esteemed universities in the UK, including Cambridge, Oxford, the University of Toronto, and Imperial College London, regarding a phenomenon they have labeled “model collapse.” This degenerative process possesses the capability to sever the connection between AI and the real world, potentially resulting in a significant disconnect.
In their paper titled “The Curse of Recursion: Training on Generated Data Makes Models Forget,” the researchers highlight that model collapse occurs when “generated data ends up polluting the training set of the next generation of models.” Consequently, these models, trained on contaminated data, begin to misinterpret reality.
The extensive volume of AI-generated content available online may perpetuate this issue. It can create a feedback loop where AI systems inadvertently incorporate distorted and inaccurate information from their own outputs, leading to further discrepancies.
Model collapse has been documented across a range of generative models and tools, encompassing Large Language Models (LLMs), Variational Autoencoders, and Gaussian Mixture Models. Over time, these models gradually lose their grasp on the true underlying data distribution, resulting in skewed representations of reality that bear little resemblance to real-world data.
Instances already exist where machine learning models are trained on AI-generated data. For example, Language Learning Models (LLMs) intentionally incorporate outputs from GPT-4, while the online platform DeviantArt allows AI-created artwork to serve as training data for newer AI models. However, these practices, akin to endlessly duplicating or cloning something, might increase the risk of model collapse, as highlighted by the researchers.
To address the gravity of model collapse, maintaining access to the original human-generated data source becomes crucial. AI models require authentic, human-produced data to accurately comprehend and simulate our world.
The research paper identifies two primary causes of model collapse. The first is “statistical approximation error,” which arises from the finite number of data samples. The second is “functional approximation error,” stemming from improperly configured margins of error during AI training. These errors can accumulate over generations, leading to a cascading effect of escalating inaccuracies.
The paper suggests that a “first-mover advantage” exists in training AI models. Preserving access to the original human-generated data source may avert detrimental distribution shifts and subsequently prevent model collapse. However, effectively distinguishing AI-generated content at scale poses a formidable challenge, necessitating comprehensive coordination within the AI community.
Ultimately, the integrity of data and the influence of human information on AI are only as reliable as the data itself. The surge in AI-generated content presents a double-edged sword for the industry. As the saying goes, “garbage in, garbage out”—AI based on AI content may yield highly intelligent yet “delusional” machines.
This ironic twist in the narrative emphasizes the risk of our machine progeny learning more from each other than from their human creators. The emergence of “delusional” AI systems could potentially foreshadow a future where we contend with the challenges of an adolescent ChatGPT and its delusions.

Be the first to comment