AI trained on AI tends to collapse (like most cannibalizing systems do).

Venture Beat, a tech investing magazine, weighs in on research that looks directly at the problems that arise when AI is used to generate web content, which is then gathered up and used to train the next generation of AI. The results of the practice (which currently seems practically inevitable) are pretty disastrous:

“Over time, mistakes in generated data compound and ultimately force models that learn from generated data to misperceive reality even further,” wrote one of the paper’s leading authors, Ilia Shumailov, in an email to VentureBeat. “We were surprised to observe how quickly model collapse happens: Models can rapidly forget most of the original data from which they initially learned.”

In other words: as an AI training model is exposed to more AI-generated data, it performs worse over time, producing more errors in the responses and content it generates, and producing far less non-erroneous variety in its responses.

As another of the paper’s authors, Ross Anderson, professor of security engineering at Cambridge University and the University of Edinburgh, wrote in a blog post discussing the paper: “Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we’re about to fill the Internet with blah. This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale. Indeed, we already see AI startups hammering the Internet Archive for training data.”


Shumailov illustrated this problem for VentureBeat with a hypothetical scenario, wherein a machine learning model is trained on a dataset with pictures of 100 cats — 10 of them with blue fur, and 90 with yellow. The model learns that yellow cats are more prevalent, but also represents blue cats as more yellowish than they really are, returning some green-cat results when asked to produce new data. Over time, the original trait of blue fur erodes through successive training cycles, turning from blue to greenish, and ultimately yellow. This progressive distortion and eventual loss of minority data characteristics is model collapse. To prevent this, it’s important to ensure fair representation of minority groups in datasets, in terms of both quantity and accurate portrayal of distinctive features. The task is challenging due to models’ difficulty learning from rare events.

You can read the researchers’ work here, in arXiv.