Science has been in a “replication crisis” for a decade. Have we learned anything?

Much ink has been spilled over the “replication crisis” in the last decade and a half, including here at Vox. Researchers have discovered, over and over, that lots of findings in fields like psychology, sociology, medicine, and economics don’t hold up when other researchers try to replicate them.

This conversation was fueled in part by John Ioannidis’s 2005 article “Why Most Published Research Findings Are False” and by the controversy around a 2011 paper that used then-standard statistical methods to find that people have precognition. But since then, many researchers have explored the replication crisis from different angles. Why are research findings so often unreliable? Is the problem just that we test for “statistical significance” — the likelihood that similarly strong results could have occurred by chance — in a nuance-free way? Is it that null results (that is, when a study finds no detectable effects) are ignored while positive ones make it into journals?

A recent write-up by Alvaro de Menard, a participant in the Defense Advanced Research Project’s Agency’s (DARPA) replication markets project (more on this below), makes the case for a more depressing view: The processes that lead to unreliable research findings are routine, well understood, predictable, and in principle pretty easy to avoid. And yet, he argues, we’re still not improving the quality and rigor of social science research.

While other researchers I spoke with pushed back on parts of Menard’s pessimistic take, they do agree on something: a decade of talking about the replication crisis hasn’t translated into a scientific process that’s much less vulnerable to it. Bad science is still frequently published, including in top journals — and that needs to change.

Most papers fail to replicate for totally predictable reasons

Let’s take a step back and explain what people mean when they refer to the “replication crisis” in scientific research.

When research papers are published, they describe their methodology, so other researchers can copy it (or vary it) and build on the original research. When another research team tries to conduct a study based on the original to see if they find the same result, that’s an attempted replication. (Often the focus is not just on doing the exact same thing, but approaching the same question with a larger sample and preregistered design.) If they find the same result, that’s a successful replication, and evidence that the original researchers were on to something. But when the attempted replication finds different or no results, that often suggests that the original research finding was spurious.

In an attempt to test just how rigorous scientific research is, some researchers have undertaken the task of replicating research that’s been published in a whole range of fields. And as more and more of those attempted replications have come back, the results have been striking — it is not uncommon to find that many, many published studies cannot be replicated.

One 2015 attempt to reproduce 100 psychology studies was able to replicate only 39 of them. A big international effort in 2018 to reproduce prominent studies found that 14 of the 28 replicated, and an attempt to replicate studies from top journals Nature and Science found that 13 of the 21 results looked at could be reproduced.

The replication crisis has led a few researchers to ask: Is there a way to guess if a paper will replicate? A growing body of research has found that guessing which papers will hold up and which won’t is often just a matter of looking at the same simple, straightforward factors.

A 2019 paper by Adam Altmejd, Anna Dreber, and others identifies some simple factors that are highly predictive: Did the study have a reasonable sample size? Did the researchers squeeze out a result barely below the significance threshold of p = 0.05? (A paper can often claim a “significant” result if this “p” threshold is met, and many use various statistical tricks to push their paper across that line.) Did the study find an effect across the whole study population, or an “interaction effect” (such as an effect only in a smaller segment of the population) that is much less likely to replicate?

Menard argues that the problem is not so complicated. “Predicting replication is easy,” he said. “There’s no need for a deep dive into the statistical methodology or a rigorous examination of the data, no need to scrutinize esoteric theories for subtle errors — these papers have obvious, surface-level problems.”

A 2018 study published in Nature had scientists place bets on which of a pool of social science studies would replicate. They found that the predictions by scientists in this betting market were highly accurate at estimating which papers would replicate.

“These results suggest something systematic about papers that fail to replicate,” study co-author Anna Dreber argued after the study was released.

Additional research has established that you don’t even need to poll experts in a field to guess which of its studies will hold up to scrutiny. A study published in August had participants read psychology papers and predict whether they would replicate. “Laypeople without a professional background in the social sciences are able to predict the replicability of social-science studies with above-chance accuracy,” the study concluded, “on the basis of nothing more than simple verbal study descriptions.”

The laypeople were not as accurate in their predictions as the scientists in the Nature study, but the fact they were still able to predict many failed replications suggests that many of them have flaws that even a layperson can notice.

Bad science can still be published in prestigious journals and be widely cited

Publication of a peer-reviewed paper is not the final step of the scientific process. After a paper is published, other research might cite it — spreading any misconceptions or errors in the original paper. But research has established that scientists have good instincts for whether a paper will replicate or not. So, do scientists avoid citing papers that are unlikely to replicate?

This striking chart from a 2020 study by Yang Yang, Wu Youyou, and Brian Uzzi at Northwestern University illustrates their finding that actually, there is no correlation at all between whether a study will replicate and how often it is cited. “Failed papers circulate through the literature as quickly as replicating papers,” they argue.

Looking at a sample of studies from 2009 to 2017 that have since been subject to attempted replications, the researchers find that studies have about the same number of citations regardless of whether they replicated.

If scientists are pretty good at predicting whether a paper replicates, how can it be the case that they are as likely to cite a bad paper as a good one? Menard theorizes that many scientists don’t thoroughly check — or even read — papers once published, expecting that if they’re peer-reviewed, they’re fine. Bad papers are published by a peer-review process that is not adequate to catch them — and once they’re published, they are not penalized for being bad papers.

The debate over whether we’re making any progress

Here at Vox, we’ve written about how the replication crisis can guide us to do better science. And yet blatantly shoddy work is still being published in peer-reviewed journals despite errors that a layperson can see.

In many cases, journals effectively aren’t held accountable for bad papers — many, like The Lancet, have retained their prestige even after a long string of embarrassing public incidents where they published research that turned out fraudulent or nonsensical. (The Lancet said recently that, after a study on Covid-19 and hydroxychloroquine this spring was retracted after questions were raised about the data source, the journal would change its data-sharing practices.)

Even outright frauds often take a very long time to be repudiated, with some universities and journals dragging their feet and declining to investigate widespread misconduct.

That’s discouraging and infuriating. It suggests that the replication crisis isn’t one specific methodological reevaluation, but a symptom of a scientific system that needs rethinking on many levels. We can’t just teach scientists how to write better papers. We also need to change the fact that those better papers aren’t cited more often than bad papers; that bad papers are almost never retracted even when their errors are visible to lay readers; and that there are no consequences for bad research.

In some ways, the culture of academia actively selects for bad research. Pressure to publish lots of papers favors those who can put them together quickly — and one way to be quick is to be willing to cut corners. “Over time, the most successful people will be those who can best exploit the system,” Paul Smaldino, a cognitive science professor at the University of California Merced, told my colleague Brian Resnick.

So we have a system whose incentives keep pushing bad research even as we understand more about what makes for good research.