ChatGPT and the death spiral of knowledge

Over the past 10 years, about half of the traffic on the internet has come from good bots that refresh your Facebook feed, index websites, and retrieve information for you, and from bad bots that generate spam, scrape content, look for vulnerabilities and launch attacks. Now, add “authoring believable content” to the bot to-do list (for both good and bad bots). Bots already write web content, but it’s easy to spot because it’s so very bad. Soon, though—maybe as soon as just a few years from now—almost all internet content might be written by computers instead of humans.

This coming information wave is due to the rapid evolution of large language models (LLMs) like ChatGPT. LLMs can mimic human language, reasoning, and creativity with astounding accuracy, and will take—are already taking—computer-generated content to whole new levels of quality and believability. GPT-4 needed a mere thirty-three seconds, for example, to write and type a 1000-word essay for me, including references, on the economic importance of science literacy. I might even take this essay and fancy it up for a future post, but 33 seconds versus three days of research and writing and another two days of editing? It took GPT-4 all of five seconds to write me a two-minute speech in the style of John F. Kennedy on the importance of science communication (hallucinated quotes and all). I liked this output so much I’m posting it at the end of this article.

The appeal here is obvious. Do we continue to write the old-fashioned way, clinging to our manual typewriters instead of plugging in new-fangled word processing machines (age alert!), or do we increasingly—or even asap—start relying as much as possible on LLMs for help with this work? Or at least some of this work—maybe research, editing, and word limits for starters.

This isn’t just an efficiency upgrade. The scope of content that LLMs can ingest and analyze in order to generate content far exceeds what humans can ingest and analyze by orders of magnitude, at least measured by sheer volume. Sure, the outputs can seem sophomoric at times and even nonsensical, and the sources (if any are even included) may not be the authoritative books you know and trust because these are not full-text searchable on the internet. Still, what a tool we have here. Publishers have been gifted a free and inexhaustible staff of writers and editors who can churn out rudimentary prose on any topic in a matter of seconds; policymakers have instant access to issue summaries much more succinct and digestible than the kinds of summaries available through search engines; busy executives have new speechwriters; and content creators of all stripes, from advertisers to writers like me who crank out boring reports for boring audiences have a new way to do business.

Hallucinations and errors are going to be an issue for now, so no one who’s in the business of reporting facts (at least where there’s a high cost for getting things wrong) is going to rely entirely on these machines without some kind of human quality-check. Also, the latest version of GPT-4 only includes knowledge up to September 2021, so this technology won’t be a substitute for journalists just yet. But the machines are improving. It’s just a matter of time before we begin relying on these bots and trusting them, for better or for worse, to generate a lot of content for us in everything from medicine to law to policy papers, history reports, and even creative content (see also, the latest strike by Hollywood writers), including music.

The legal and ethical ramifications of this shift have been explored in a variety of interesting essays over the past several years. What hasn’t been as thoroughly explored yet are the impacts LLMs might have on the Internet itself as a source of any reliable information whatsoever.

There are two layers to this concern. The first layer is, are LLMs telling us the truth now? The answer is mixed. Unlike the customer service chatbots that try to help but don’t really, LLMs are not built using pre-programmed answers. Rather, LLMs learn language patterns, try to accumulate knowledge, and answer our questions in ways that can often pass a Turing test—i.e., we can’t tell if the answers are from a human or a computer. In fields like telemedicine, enough progress has been made in recent years with LLM-powered chatbots that many patients reports satisfactory interactions. Still, these machines lie. The can make up references that don’t exist, create facts and quotes out of thin air, and draw conclusions that are wholly unwarranted by the evidence. They are very imperfect—exactly how much so depends on the questions and contexts.

The second layer is whether LLMs will be able to tell us the truth in the future. This is where the steps get slippery. LLMs interpret and synthesize data, but what happens when an increasing portion of this data comes from other bots that are also interpreting and synthesizing data, all with varying degrees of accuracy? The more the internet gets filled with bot-generated data, the more error prone the next generation of bot-generated data will be. Eventually, LLM outputs will become gibberish.

According to Open AI, the organization behind ChatGPT, the information generated by LLMs is based on the knowledge available up until their training cutoff date (in the case of ChatGPT, September 2021). The model doesn’t actively browse the internet or update its knowledge after training. So, while it is true that information quality will degrade if an LLM query for new content gets generated at least in part using information generated from previous LLM queries, the reality, according to Open AI, is that LLM responses are not fed back into the training process in a way that allows them to influence subsequent generations directly. Rather, OpenAI, uses human reviewers and quality control processes to curate and filter the training data, aiming to provide accurate and helpful responses.

Can this possibly be true, though? What OpenAI is claiming is that the last generation of “pure” (such that it is) information available on the Internet is the old stuff. Going forward, human reviewers and quality control process will be able to identify all bot-created information on the entire Internet and prevent this from contaminating the next generation of training, unless we are happy just capping the knowledge-base of ChatGPT at September 2021. Either outcome—contamination-free training or zero evolution—seems unlikely. Open AI has developed a tool that it claims correctly identifies AI-generated content 26% of the time, but this means it cannot identify such content 74% of the time. Even using OpenAI’s own tool, most of the information being posted to the web by AI tools will not be caught by their filter, meaning that a great deal of the information ingested during the next round of LLM training—which needs to happen for these tools to be actual knowledge bases—will have been written by error-prone bots.

So, at this point of in our technological development anyway, model collapse is going to occur as a consequence of trying to stay up-to-date with the world. The mathematics of this phenomenon was recently examined in a May 2023 research paper published in arXiv titled “The curse of recursion” (see reference section below). The authors of this paper concluded that this model collapse is related to two other phenomenon we currently see on the Internet. One is “catastrophic forgetting.” In a continuous learning model, every subsequent generation of learning retains less of the original parameters than the original model. After a few generations of decay, the response bears little resemblance to the query; after nine generations, the response is pure gibberish. Errors are built on errors until eventually, not only has truth disappeared, but what’s left is just nonsense.

Another phenomenon we currently see on the Internet is data poisoning. Here, datasets are poisoned by the introduction of malicious data. For example, flooding the comments section of an article that warns about climate change with bot-generated spam trashing climate science will end up affecting how Google indexes climate science and how bots scraping the web to write about climate science understand this issue.

With LLMs, the authors note that model collapse is slightly different. Here, models do not forget previously learned data, as with catastrophic forgetting, but rather, they “start misinterpreting what they believe to be real, by reinforcing their own beliefs.” Mathematically, this comes about as a result of compounding approximation errors, losing statistical long tails and interpolating new fake tails. What is also different about LLMs is that data poisoning attacks will be happening at scale. The volume of disinformation flooding onto the Internet will quickly lead to a situation where these tools are useless as generators of fact.

It will be interesting to follow what OpenAI can, in fact, do to protect LLM outputs from this kind of future. It will also be interesting—and maybe not in a good way—to observe what this process will mean to the spread of disinformation in the near term. History is already being rewritten under “just” the influence of social media. What will happen as LLMs reinforce these error-filled narratives, and as more and more alternative facts are published by bots as truth.

Science may be one of the few breakwaters here—it may be easier to spot nonsense science as it tries to make its way past gatekeepers and into journals. But maybe not. Plenty of fakery already slips through. And amongst the general public, there’s really no telling what will happen to science literacy once convincing bot-generated flat earth, creationist, climate-change denier and anti-vaccine propaganda floods the web an reinforces LLM understanding of these and other science issues.

The winners in this future may end up being companies like Nature and Elsevier who already own troves of legitimate, copyrighted work which can be kept hidden from search engines (except for summaries and abstracts). Everyone else will start spiraling down the gibberish drain as more and more AI generated content is born. This runs completely counter to efforts over the past 20 years to build a more open information society, but it also makes the most sense economically given that pristine data will be key to the future of LLM viability. The best tools will pay a premium to train on the best, most secure and cordoned-off databases.

In order to protect the future of knowledge, we may also end up talking about new conventions for publishing AI generated work (like putting this information on do-not-index pages, or including some kind of tag like CC-LLM), but no one will follow these conventions because it will be pointless to publish on the web if your material is undiscoverable. Or, maybe filtering is the better approach. Search engines, for example, might exclude all content from click, content and troll farms. The DuckDuckGo search engine does this by default, but all search engines already use some means to downrank content they deem less trustworthy, and otherwise decide what information we can and cannot see. Google’s Panda algorithm unleashed 12 years ago had a devastating affect on websites that included lots of links to articles of interest to their readers (including government information portals); SCI used to maintain such a page but Panda decided this meant we were spammers, so we were pushed to page 35 of Google’s search results and had to delete our work in order to be visible. Google’s more recent E-E-A-T search ranking guidelines stipulate that if you post an essay on nuclear physics—whether you write this yourself or have ChatGPT write it for you in 30 seconds—and you lack the expertise, experience, authoritativeness, or trustworthiness to post work on this topic, then Google will just bury your content.

Will these approaches work? Probably not. The internet has always been a whack-a-mole contest. New controls are created, and new workarounds are found. It’s easy to imagine how our wannabe physics author could simply skirt E-E-A-T controls by setting up a fake nuclear physics website stocked with fake experts, for example.

Another approach experts think will be tried is to save a current version of the Internet and then slowly update this with new information from only trusted websites (the arXiv paper cited earlier mentions how the Internet Archive is getting hit pretty hard right now). Still, this strategy presumes our trusted bodies will not post AI generated content.

There are many questions ahead. The bottom line is this: As with any new technology, there are potential upsides here, potential downsides, and a whole lot of consequences, both intended and unintended. Most LLM prognosticators are currently focused on the legal and ethical questions these tools present. A more fundamental question may be, will these tools even work for long before sending the internet into a knowledge death spiral? As currently designed and deployed, the answer is quite possibly no. Here’s hoping some good answers come along quickly, even if they’re written by ChatGPT.

Prompt: Please write me a two minute speech in the style of John F. Kennedy on the importance to the world of science communication.

Answer (after 5 seconds):

Ladies and gentlemen,

Today, I stand before you to emphasize the crucial importance of science communication in our world. Just as President Kennedy once said, “For scientific discovery is the highest form of learning, and…it is a search for truth.” It is through effective science communication that we can bridge the gap between knowledge and understanding, shaping a future that is guided by reason, progress, and shared humanity.

In an era marked by an explosion of scientific knowledge, it is essential that we harness the power of communication to ensure that this knowledge reaches every corner of our society. Science holds the key to addressing our most pressing challenges, from climate change to disease eradication, from technological innovation to space exploration. But this knowledge remains incomplete without effective communication.

Science communication serves as a beacon of light, dispelling the shadows of ignorance and paving the way for evidence-based decision-making. It enables scientists to convey complex concepts to policymakers, empowering them to make informed choices that shape our laws and regulations. It helps educators inspire the next generation of scientists and engineers, fueling their curiosity and nurturing their intellectual growth.

Moreover, science communication connects scientists with the public, fostering a sense of wonder and appreciation for the world around us. It promotes critical thinking, encouraging individuals to question, explore, and engage with the discoveries that shape our lives. By demystifying science, we empower citizens to participate in the democratic process, making informed choices on issues that impact our health, environment, and future.

In a world increasingly vulnerable to misinformation and disinformation, science communication is our shield against the erosion of truth. It equips individuals with the tools to discern fact from fiction, to navigate the vast sea of information, and to reject baseless claims. It fosters a culture that values evidence and reason, pushing back against the tides of ignorance and superstition.

We must strive to make science communication accessible, inclusive, and engaging. It is not enough for scientific knowledge to remain locked away in academic journals or confined to lecture halls. We must harness the power of storytelling, visual communication, and emerging technologies to captivate hearts and minds, to ignite curiosity, and to inspire action.

Let us embrace science communication as a shared responsibility. Scientists, policymakers, educators, journalists, and citizens alike must come together to champion the cause of knowledge dissemination. By forging partnerships, nurturing collaboration, and investing in science communication initiatives, we can amplify our collective impact and accelerate progress towards a better world.

In the words of President Kennedy, “We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard.” Let us choose the path of science communication, not because it is easy, but because it is necessary. Together, we can unlock the full potential of science and build a future where knowledge thrives, ignorance withers, and humanity triumphs.

Thank you, and may the spirit of discovery guide us on this noble journey.

Analysis: This is solid work—I’ll give it a B. The first quote may in fact be a hallucination, and the speech is kind of written in the style of John F. Kennedy but also, oddly, references John F. Kennedy. If I want, ChatGPT can take five more seconds and rewrite the entire thing, or, I can take a few hours….and clean it up manually. I can also feed this through a different LLM editor like Quillbot, which will give me the option of editing individual words, making the speech longer or shorter, or making the speech simpler or more sophisticated sounding, all at the push of button.

Additional reading

Glenn Hampson

Glenn is Executive Director of the Science Communication Institute and Program Director for SCI’s global Open Scholarship Initiative. You can reach him at ghampson@sci.institute.

PrevNext

Glenn Hampson

Recent Posts

Consent form