How many scientific papers are fake?

Recent studies show that fabrication and falsification of scientific results may be more common than previously thought. A new review estimates that approximately one in seven papers are fake.

The replication crisis

Scientific misconduct is a sensitive topic. Proving that a researcher has falsified data is challenging, and accusations can have serious consequences. This may explain why, for a long time, the subject received little attention. It was assumed that fraud was rare and the result of a few bad apples. Most cases were discovered not by editors or peer reviewers but by whistleblowers (often PhD students) within the same research team.

Until recently, few people in science were actively looking for fraud. While systematic reviewers assessed the quality of studies, they rarely considered that some results may have been entirely fabricated. This was not part of their job. Almost everyone assumed researchers were acting in good faith.

An often-cited estimate comes from a 2009 meta-analysis of surveys on research misconduct among researchers. It found that approximately 2% of scientists “admitted to have fabricated, falsified or modified data or results at least once.” Although the surveys were anonymous, fraudsters may have been reluctant to answer such a sensitive question truthfully. The estimate also indicates how many researchers may have falsified data, not the proportion of false research papers. Researchers who fake data are likely able to churn out higher quantities of papers.

In the past 15 years, things have slowly started to change. Large collaborations found that many influential studies in psychology and biomedical sciences could not be reproduced. As a response to this ‘replication crisis’, scientific methods and results were scrutinized more closely. Data sleuths started looking for impossible statistics and manipulated images. There were a lot of skeletons in the closet.

A flood of retractions

You might have heard about some of the recent scandals involving scientific misconduct. One of the bigger ones involves the amyloid hypothesis in Alzheimer’s research. An influential 2006 paper in Nature, which appeared to support the view that buildup of amyloid proteins is a key driver of neurodegeneration, was retracted because of doctored images. Similarly, the pharmaceutical company Cassava Sciences recently halted its clinical trials for Alzheimer’s after it was revealed that the research underlying them also contained manipulated images.

Another scandal affected the Harvard-affiliated Dana-Farber Cancer Institute which retracted six papers and issued corrections in many others after a biologist exposed multiple issues in a blog. Meanwhile, the president of Stanford University resigned following revelations that his research had an “unusual frequency of manipulation of research data and/or substandard scientific practices.”

At the beginning of the Sars-Cov-2 pandemic, a widely cited French paper claimed that hydroxychloroquine (brand name Plaquenil) was an effective treatment for COVID-19. This study has now been retracted because of major scientific flaws. Upon further scrutiny, more than 30 other publications from the same research group were also retracted.

There’s more to the story. A major study in the Lancet concluded that hydroxychloroquine is ineffective for treating COVID-19, but it too was ultimately discredited. The study relied on a data database from more than a thousand hospitals curated by the company Surgisphere. Some of its data appeared impossible. The number of COVID-19-related deaths in Australia, for instance, exceeded the official recorded figure. The Lancet retracted the paper with its editor condemning it as “a shocking example of research misconduct in the middle of a global health emergency.”  

Studies on Ivermectin, another drug touted as a treatment for COVID-19, fared no better. Several trials showed irregularities or impossible numbers and have been withdrawn. A Cochrane review excluded half of all trials (7 out of 14) “as these trials did not fulfill the expected ethical and scientific criteria.”

Harmful errors

Next to wasting research time and money, misconduct in medical research has also caused harm. Take the example of German anesthetist Joachim Boldt. He published multiple papers on hydroxyethyl starch, a solution that boosts blood volume in critically ill patients undergoing surgery. Boldt was a proponent of hydroxyethyl starch, and his studies showed less mortality in treated patients compared to a control group. However, most of his studies have been retracted because of research misconduct including lack of ethical approval and fabrication of data. A review in JAMA found that, with Boldt’s studies excluded, starch infusions were associated with renal failure and increased risk of death.

Another prominent case is that of Dutch cardiologist Don Poldermans. His research suggested that administering beta blockers before surgery could reduce complications such as heart attacks and strokes. Based on his findings, some medical guidelines recommended this approach. However, in 2012 an inquiry by his former employer, Erasmus Medical School, concluded that Poldermans had used fictitious data. A 2014 review that excluded his studies found that beta blockers led to a 27% increase in mortality and urged a rapid update to clinical guidelines. It is unclear how much damage the false data caused, but some estimates go as far as hundreds of thousands of deaths.  

Let’s look at a third, more recent, example. A 2018 Cochrane review concluded that giving steroids before a cesarean section would improve breathing in premature babies. These findings found their way into several guidelines. The data on late pregnancy, however, were based on only 1 British and 3 Egyptian studies. The latter were fraught with statistical issues and unrealistic data. The largest trial on more than 1000 pregnant women reported a proportion of female babies of nearly 60%, a figure the authors could not explain (it should be close to 50%). The journal retracted the paper with the following warning:

“The editorial board has raised concerns regarding the integrity of this paper and remaining in the public domain. It is by far the largest trial of corticosteroids for this indication, overwhelming other similar trials in Cochrane and other systematic reviews and is likely leading to widespread prescription of this drug, which may have serious side effects on fetal brain development. If the data is unreliable, women and babies are being harmed.”

In 2021, Cochrane introduced a new policy that recommended identifying ‘problematic’ trials and excluding them from reviews. An update of the 2018 review without the Egyptian trials now concluded there was insufficient evidence to reach a conclusion.

Paper mills

Sometimes research misconduct is detected because study results look too good to be true. They stand out by their enormous effect sizes or lack of treatment complications. Critics noted, for example, that the results by Japanese researcher Yoshitaka Fujii looked  ‘incredibly nice’ a decade before he would end up second on the Retract Watch Leaderboard with more than 170 retracted papers. The data of social psychologist Diederik Stapel fitted his hypotheses so perfectly that during one research meeting someone jokingly remarked: “It is as if he made up these data himself.”  This was unfortunately not far from the truth. Stapel now ranks eighth on the Retraction Watch Leaderboard. The New York Times described him as a “con man” who “perpetrated an audacious academic fraud by making up studies that told the world what it wanted to hear about human nature.”

Other times, false papers try not to stand out. The goal is not to influence policy or gain attention or prestige but simply get a quick and easy publication. This helps doctors and researchers get promotion or earn their PhD. Shady companies play into this demand by writing research articles and offering paying customers to be named as an author. It is estimated that these so called ‘paper mills’ have inserted thousands of problematic papers into the literature. The Hindawi journals, acquired by the academic publisher Wiley, were severely affected by paper mills. In 2023 alone they retracted over 8000 publications.

Image duplication

How is falsification detected? In the case of paper mills, there can be clear warning signs such as author affiliations that do not match the article content or the addition of new authors just before publication. Other forms of research misconduct are harder to notice. In the past ten years, however, a group of data sleuths has focused on detecting errors in scientific papers. Frustrated with the lack of awareness of fraud in their field, they developed new ways to spot falsified data.

One of the most effective ways to detect research misconduct is by identifying duplicated or manipulated images. Many high-profile retractions mentioned earlier were uncovered when sleuths posted examples of doctored images on PubPeer (a website where you can post anonymous comments on scientific papers). By carefully examining figures, they found instances where parts of an image were duplicated in a different experiment or where a background was subtly photoshopped into another image. These can provide strong evidence that research misconduct took place, but it takes quite a lot of time and skill to notice them. Automated tools such as Imagetwin can help, but it still takes a lot of work from researchers who rarely get paid to clean up the literature.

The Dutch microbiologist Elisabeth Bik is a pioneer and expert in discovering tampered images. In one study, her team screened 20,621 papers published in 40 scientific journals and found problematic images in 3.8% with at least half exhibiting features suggestive of deliberate manipulation. Another research team screened 1035 preclinical studies of depression and found that 19% had problematic images. In most cases, images had been altered or recycled in a way that suggested foul play. The authors noted: “… most problematic studies (and among them, we speculate, the fraudulent) are mundane. They do not make waves; they agree with the general consensus within the field.”  They also noted that only incompetently altered images could be detected: “… a skilled Photoshop user could easily fool us.”

Carlisle’s method

Datasets can also reveal signs of duplication, such as the same numbers appearing in different experiments or columns in Excel sheets that have been copied and pasted. In the case of Fuji, readers noted that the frequency of headaches was identical between groups in 13 different articles.

If fraud detectives have access to an Excel datasheet, they can also look at the revision log to see what changes have been made to the file (if you unzip an Excel document, you get a calcChain.xml. file that contains this information). In the case of Harvard Business School Professor Francesca Gino, researchers could see that some numbers switched positions in the Excel dataset and that this manipulation caused the effect reported in the study.

More often, researchers refuse to share their dataset so fraud detectives can only work with summary statistics. Fortunately, generating truly random data is rather difficult, which has allowed researchers to develop various techniques for detecting suspicious patterns.

Anesthetist John Carlisle, for example, developed a method using baseline data from randomized trials. These are measurements taken in different groups before they receive an intervention and are usually reported in the first table of a paper. Because participants are randomly divided into groups, these measurements should be roughly similar: not exactly the same (due to random variance) but not too different either. For each continuous baseline measurement, Carlisle calculates group differences and a p-value to get an indication of how unusual the difference is. He then combines these p-values to get an overall estimate of how likely it is that both groups were created randomly from the same population. A combined p-value too close to 0 indicates that the baseline values are remarkably different while a value close to 1 suggests they were suspiciously well-balanced.

Carlisle’s method is far from perfect (it does not account for dependence among baseline variables) but it can be useful as a screening tool to pick up on suspicious papers. It helped, for example, to detect problems with a major trial of the Mediterranean diet published in the New England Journal of Medicine.

GRIM and SPRITE: impossible values

Some summary data are highly unlikely. The DECREASE I trial of beta blockers by Poldermans and colleagues forms an illustrative example. In this study, participants’ heart rate was between 62 and 80 beats per minute with a standard deviation of 9.3. It turns out that it is nearly impossible for the standard deviation to be this large and yet the range so narrow. This was one of the reasons a review considered its results “unsound”. Data sleuths Nick Brown and James Heathers have developed software called SPRITE which helps to reconstruct possible datasets from summary statistics. Sometimes these can point to impossible combinations. For example, the maximum value a standard deviation can take is just over* half the possible range on a scale (i.e. when half the values are at the minimum and half are at the maximum).

When working with integers, averages are often limited to certain values. Suppose that researchers measure happiness in 30 participants with a 7-item Likert scale with scores going from 1 (very sad) to 7 (very happy). The smallest possible increase of the mean score is when one of the thirty participants changes their score by the minimal amount, in this case, one point. This means that the average can increase with increments of 1/30, but not smaller. If the paper reports an average of 3.51, we know that this represents an error.  The average can consist of 105 minimal increments and end up at 3.50. It can have 106 increments and end up at 3.53. But it can’t be somewhere in between. Brown and Heathers have developed a tool called GRIM (granularity related inconsistency of means) to test this. They tested it in 71 publications and found that 12 had multiple inconsistent means with the authors refusing to share their dataset to clarify.

Tortured phrases

Other data detectives have fully developed automatic tools to scan papers for inconsistencies. Statcheckdeveloped by Michele Nuijten, for example, recalculates test statistics of reported p-values. Papers usually report their results like this: “….the difference between groups was statistically significant (t(28) = 2.2, p < .05)”. Statcheck will recalculate the p-value based on the data between brackets to check if those values match. It works more like spelling control for statistics than a fraud detection tool. The most common error is that researchers wrongly transcribe statistical results into their papers. Statcheck also only works for statistics presented in the format of the American Pscyhology Association (APA), while medical journals often use different formats.

In genetics research, there is a tool called ‘Seek & Blast’ developed by cancer researcher Jennifer Byrne. It focuses on nucleotide-sequence reagents, little pieces of DNA or RNA that bind to specific parts of natural genetic material. Seek & Blast automatically checks if researchers use the correct reagent for the genetic target the paper claims to study. In a screening of more than 10.000 papers, seek & Blast found discrepancies in 6.1%. Because many papers were making the same errors, Byrne suspect this may point to the work of paper mills.

We’ve saved the funniest tool for last: the tortured phrases detector. Sometimes researchers copy paste text from other academic papers. To avoid accusations of plagiarism they use tools that automatically rewrite the text. But this doesn’t always go well. Sometimes it results in phrases that sound weird and no longer make sense in their context. ‘Artificial intelligence’ becomes ‘counterfeit consciousness’ while ‘deep neural network’ is changed into  ‘profound neural organization.’ Nonsensical terms like these suggest a paper has been produced by a paper mill. Guillaume Cabanac and colleagues have pioneered the detection of tortured phrases in computer science, but others have used it in medicine and found some hilarious examples. ‘Anal canal’ became ‘butt-centric waterway’ while ‘breast cancer’ is often rephrased as ‘bosom peril’. A database of tortured phrases can be found here.

Lacking ethical approval

Many of the tools above are more likely to produce red flags than a smoking gun. The errors they pick up can be accidental and benign. Different ways of rounding figures might explain some discrepancies. Because these tools require statistical expertise to be used and interpreted correctly, we do not suggest everyone should have a go with it.

When several red flags are raised, integrity researchers might look at other papers by the same authors to see if they contain duplications or inconsistencies. This allows them to present a stronger case to journals and universities and request an inquiry. An official investigation often reveals further problems such as lack of ethical approval for studies.

This was the case with the Bezwoda study on bone marrow transplants for cancer patients or several microbiology studies by the French group that promoted hydroxychloroquine as a treatment for COVID-19.

Sometimes researchers are unable to explain how their data was collected. Stapel, for example, claimed that he collected data at schools but when the rector sought to contact these schools, he admitted that they did not exist. Another prominent example is the car insurance study by behavioral economist Dan Ariely. When integrity researchers spotted that the data from this study were likely fabricated**, Ariely was unable to clarify how it was collected.

Heather’s meta-analysis

Last year, a review by data sleuth James Heathers put all of this information together to argued that approximately 1 in 7 scientific papers is fake. The review is based on several studies that scanned a large part of the literature looking for signs of fabrication and falsification using the methods explained above. The studies focus mostly on psychological and biomedical sciences. Heathers found 16 estimates in total with a median estimate of 14%. Although this is but a rough guess and Heathers admits his review is ‘wildly nonsystematic’, it does suggest that research misconduct may be more common than previously thought.

Science integrity funds

Several initiatives are being taken to combat fraud in the scientific literature. Researchers are, for example, developing checklists for systematic reviews to determine if a paper is trustworthy or not (examples here, here, and here). The biggest project is the INSPECT-SR tool which is led by Jack Wilkinson, a statistical editor for Cochrane.

Others are providing funding for data sleuths and post-publication reviews. Retraction Watch, for example, announced a ‘Sleuth in Residence Program’ that offers a secure and paid position for an active sleuth with a proven track record. Similarly, Elisabeth Bik used the proceeds of her Einstein Award to create a Science Integrity Fund. It will provide funding for sleuths as well as training programs, grants, and awards for science integrity advocates. A project that is already underway is called: ERROR: A Bug Bounty Program for Science.”  Led by Dr. Ian Hussey and modeled after bug bounty programs in the tech industry, it aims to systematically detect and report errors in scientific publications.

These are positive developments but there are also some worrying signs. One of the saddest statistics we’ve read on this topic is that many retracted papers keep getting cited as if there were no problems with them at all. They are often called ‘zombie papers’ because their influence on the literature just won’t die.  

The ME/CFS literature

There’s just one question left to answer in this blog post: what about the ME/CFS literature, does it also suffer from paper mills and fraudulent studies? Except for the XMRV-papers very few ME/CFS studies have thus far been retracted. Our experience is that ME/CFS studies are often unreliable because of a flawed design or methodological weaknesses rather than falsified data. Perhaps if results can easily be skewed to fit a desired outcome, there is less need to fabricate data outright. It could be, however, that some problematic papers have remained undetected because not enough experts took notice. Hopefully, time will tell how much of the current literature is trustworthy.

Notes

* A standard deviation can be slightly higher than half of the range because the formula for a standard deviation in a sample, divides by n-1 instead of n.

** The data wasn’t normally distributed as you would expect. As Prof. Uri Simonshon explains: “Normally, you’d expect most people to drive a medium amount – say, like, 14,000 miles a year – and way fewer people to drive very little or a lot. But in Ariely’s data, the same number of people drove around 1,000 miles as 10,000, as 50,000.”

1 thought on “How many scientific papers are fake?

  1. Barbara McMullen says:

    Let’s hope this results in more honest research.

    Reply

Leave a Reply