Problems with the MetaBLIND study

The MetaBLIND study is likely the largest study on the effect of blinding in randomized trials to date. Contrary to expectations, the study did not find a relationship between exaggerated treatment effects and lack of blinding of patients, healthcare providers, or observers. I’ve contacted the authors to obtain the dataset of one of the most important analyses of the study, namely the impact of blinding trial participants on patient-reported outcomes. After screening the blinded and unblinded trials that were compared to each other, it became clear that the MetaBLIND study suffers from serious flaws. Some of the analyses had little relevance to medical trials, others included trials that were wrongly labeled as blinded and in most cases trials were simply too different for a meaningful comparison.

Introduction: what is the MetaBLIND study?

The MetaBLIND study was published in 2020 in the BMJ by Moustgaard and colleagues. The aim was to see if trials where patients, therapists, or outcome observers are not blinded, have exaggerated treatment effects. Cochrane reviews from the years 2013-2014 were screened for meta-analyses that included both trials that were blinded and trials that were not. The authors then compared both groups to see if trials that were unblinded were systematically associated with larger treatment effects. In other words, they checked whether the effect in favor of the intervention was bigger in the unblinded compared to the blinded trials, which would support the view that the former are at higher risk of bias. The results were expressed as ratios of odds ratios and coded so that a figure lower than 1 indicates a bigger treatment effect in the unblinded group.

Contrary to expectations, the MetaBLIND study could not find a relationship between exaggerated treatment effects and lack of blinding of patients,  therapists, or outcome observers. Although Moustgaard and colleagues remained cautious in their conclusion, their article was accompanied by BMJ commentaries with titles such as “Blindsided: challenging the dogma of masking in clinical trials” and “Fool’s gold? Why blinded trials are not always best”. These suggested that blinding is not as important as previously thought. A research group from the University of Exeter commented that the results of the MetaBLIND study “challenge the status quo about the importance of blinding.” “There is cautious optimism among complex intervention trialists”, they write, “that there may be a real chance, for the first time, to amend the risk of bias criteria regarding blinding.”

A closer look at analysis Ia:

The MetaBLIND study conducted multiple analyses to test the effect of blinding patients, therapists, and outcome assessors using different outcome measures. For simplicity, I only looked at and asked for the data of analysis Ia: “Blinding of patients in trials with patient reported outcomes (considering a combination of detection bias and performance bias).” For this analysis, data from 132 studies and 18 Cochrane reviews were used. Moustgaard et al. report a ratio of odds ratios of 0.91 (0.61 to 1.34) and conclude: “The implication seems to be that either blinding is less important (on average) than often believed, that the meta-epidemiological approach is less reliable, or that our findings can, to some extent, be explained by lack of precision.” In the rest of this blog post, I will focus on this comparison only. When I refer to trials being rated as ‘blinded’ or ‘not blinded’ I am referring to the impact of blinding trial participants on patient-reported outcomes.

The first thing I noticed is that most (76,6%) of the 132 trials were unblinded rather than blinded. This raises the question: if blinding is indeed possible why did most of the studies fail to apply it, after taking the time and effort to set up a randomized trial? This would only make sense if trials came from a field where blinding is particularly difficult to arrange such as surgery. As we will see, very few of the included trials were of that nature.

Data that isn’t relevant

Most trials (68/132) came from two Cochrane reviews, namely:

  • Interventions to promote informed consent for patients undergoing surgical and other invasive healthcare procedures (CD009445)
  • Decision aids for people facing health treatment or screening decisions (CD001431)

Both are problematic because they don’t test the effectiveness of an intervention to treat a medical condition but the knowledge recall of participants who were given different types of information sheets.

Only 3 of the 68 trials from these reviews were rated as blinded, meaning that both comparisons depend on these 3 trials. A brief look at them immediately shows why they found a treatment effect in favor of the intervention. The intervention consisted of an information sheet that was more relevant to the outcome measure than the standard leaflet (the control). The authors of the only blinded trial from CD009445 noted for example: :

“The ‘standard’ leaflet had been constructed without reference to this study, by consensus between a large number of specialist anaesthetists. In contrast, the knowledge questionnaires and the ‘full’ information sheet were designed by the investigators to address information thought important. It turned out that the information in the ‘standard’ leaflet (unlike the ‘full’ leaflet) did not actually cover all the issues addressed by the knowledge questionnaire.”

It’s a bit like giving one group the answers to a test and the other group a general syllabus about the topic and then finding that the first group scored better.

The same is true for the two blinded trials from CD001431. In the study on colorectal cancer screening by Steckelberg et al., the control group received a standard information leaflet where “no quantitative information on individual risk or benefit is included, and harm is incompletely communicated.” These were precisely the things the knowledge questionnaire asked about, so no wonder that the intervention group whose leaflet did contain this information, scored better. Similarly, in the study on the risks of cesarean section by Shorten et al., knowledge was assessed using a 15-item questionnaire that was developed and piloted for the study based on key risk and benefit information contained in the decision-aid. The control group participants received no intervention, only routine pregnancy care.

In other words, there was an obvious explanation why these studies found that the intervention outperformed the control group but it had little relevance to clinical trials. Knowledge scores weren’t seen as the main outcome measure in these trials. They were merely a first step to check if patients had picked up on the information provided in the decision aids the authors designed. What the researchers wanted to know is whether the decision aids could change the participants’ medical decisions and behavior. They found that they didn’t. While the intervention group scored better on knowledge recall, this did not change the uptake of colorectal cancer screening or cesarean section. In my view, it is inappropriate to extract the data on knowledge recall from these studies to measure the effects of blinding in clinical trials. Both meta-analyses should have been excluded.

The same is true for the data from a third Cochrane review: “Strategies for partner notification for sexually transmitted infections” (CD002843). There were four trials in this comparison but only the one by Ostergaard et al. was blinded. I had a closer look at it and it appears that the data extracted – the number of sexual partners notified – wasn’t relevant to the intervention: it wasn’t an outcome measure.

Ostergaard and colleagues investigated two strategies  – home sampling versus office sampling – to encourage sex partners of index patients (people with chlamydia) to get tested for chlamydia as well. With home sampling, the sex partners could send their sample directly to the lab while office sampling required them to send it to a healthcare provider in an office. Given that index patients were blinded to the sampling method and that they simply had to pass on the test kit to their previous sex partners, the number of partners they notified isn’t dependent on the intervention. In other words, it was not an outcome measure but information about the set-up of the study, much like a response rate. The authors of the Cochrane review calculated this data from flow diagrams and I suspect Moustgaard extracted it automatically without noticing that it could not be used to assess the effect of blinding. In the study by Ostergaard et al., there was no difference between the number of partners notified for home sampling versus office sampling. I think this data should also have been excluded from the MetaBLIND study as it has no relevance to the impact of blinding in clinical trials.

The blinding status: some trials seem to have been wrongly labeled

Another issue is the blinding status. Moustgaard et al. estimated whether participants of trials were blinded based on the trial reports and contact with the author team where possible. The blinding of trial participants was rated as ‘definitely yes’, ‘probably yes’, ‘unclear’, ‘probably no’, or ‘definitely no’. The first two categories were compared to the last three to estimate the impact of blinding. In other words, trials that were rated as ‘unclear’ were included in the unblinded group.

There are several meta-analyses where the ambiguous blinding status of one trial determined the whole comparison. Take for example the review on “Music for stress and anxiety reduction in coronary heart disease patients” (CD006577). This comparison depends on the study by Karin Schou as it was the only one rated as blinded. The study was an unpublished Ph.D. thesis with data from only 17 participants. A description of the intervention and control group is given below. Given the difference in contact time and interaction with therapists, it is highly unlikely that this trial was blinded.

“Group A (GRM): a music therapy treatment, consisting of a receptive music therapy method Guided Relaxation with Music, (Music therapy and medicine) in which the participants received personalised individual sessions of guided relaxation with music with a trained music therapy research team member (RTM) in the role of guiding a body relaxation based on the patient’s preferred style of music for relaxation. The music was selected by the participant when offered four different styles of music.

Group C (NM): a No Music control condition, consisting of scheduled rest without musical or verbal intervention. The participants received individual sessions of scheduled rest with no music, and rested on their own while the RTM stayed in another room in the unit. The nursing staff team monitored the patient from their workstation, and were able to respond to any electronic alarm made by the participant during this period. The participants were advised how to call for help if they needed it. During the course of the first year of data collection the participants rested on their own. Due to an increase of anxiety and discomfort in two cases, for the second year until the conclusion of the data collection the RTM stayed in the room in the NM group in a role similar to that of the RTM in the ML group.”

The Cochrane review seems to agree that patients in the trial by Schou were not blinded as it comments on this study that “music therapist and participants could not be blinded given the interactive nature of the music therapy session.” Moustgaard et al. likely made a mistake in classing the blinding status of this study as “definitely yes’. Maybe they misread the dissertation or perhaps they based their choice on the review’s risk of bias assessment. The Cochrane review confusingly rated the trial by Schou at low risk of performance bias while explaining that“since participants cannot be blinded in a music intervention trial, we did not downgrade studies for not blinding the participants.”

The are other examples where the blinding status of one trial determines the outcome of the whole comparison. In a meta-analysis on “valproate for the prophylaxis of episodic migraine in adults” (CD010611) three studies were rated as probably blinded while the trial by Kaniecki and colleagues was rated as ‘unclear’. Because trials with unclear blinding were grouped with unblinded trials, this comparison depends on whether participants in the trial by Kaniecki and colleagues were nonblinded. That is problematic because the report by Kaniecki and colleagues says that patients were blinded. The authors of the Cochrane correctly note that blinding could have been broken by the knowledge of therapists and differences in the appearance or taste of the intervention versus the placebo but this remains uncertain. The fact that all four studies in this comparison report that patients were blinded, makes them a problematic choice to measure the effect of blinding trial participants.

Something similar is true for the data from a Cochrane review on “Megestrol acetate for treatment of anorexia-cachexia syndrome” (CD004310). Three trials were rated as probably blinded while the blinding status of two studies by Schmoll and colleagues were rated as ‘unclear’. Although blinding is not explicitly described in these studies, the fact that they used a placebo-control design suggests that participants may have been blinded to treatment allocation. This comparison, therefore, lacks a trial where participants were clearly not blinded.

Sometimes the opposite happened. Some comparisons lack studies that were clearly blinded. The review “Pharmacological interventions for pruritus in adult palliative care patients” (CD008320), for example, had two trials that were described and rated as blinded. There was however a high probability that blinding in both trials was broken because the intervention, the antibiotic rifampicin, gave the patient’s urine a red color. One of these trials noted that “as rifampin discolors urine, there was a  potential for determining the treatment being given,  and eliminating the double blind nature of this study.” The other trial also cautioned that “most patients taking rifampin develop a red-orange coloration of the urine, thus allowing them to identify the experimental period, and eliminating the double-blind nature of this study.” This means that these trials could be an inappropriate choice to measure the effects of blinding trial participants.

Many trials were too different for a meaningful comparison

A third major problem that shows up when reading these trials is that many are too different for a meaningful comparison. Often there are significant differences in the intervention and control arm, the length of treatment, the inclusion criteria, or the outcome measure used across trials. There are so many factors that could explain differences in treatment effects that it is nearly impossible to distill the impact of blinding.

Take, for example,  the review “Antibiotics for preventing complications in children with measles” (CD001477). Only two trials were extracted and compared. The unblinded trial by Karelitz and colleagues was conducted on children in New York in the 1950s. The blinded trial by Garly included patients of a measle outbreak in Guinea in the 1990s. Such large differences make it rather difficult to find the effect of blinding.

Another example is the review on treating pruritus mentioned above (CD008320). In contrast to the two blinded trials, the unblinded trial by Bachs and colleagues did not compare rifampicin to a placebo in a cross-over design but to another drug called phenobarbitone in a parallel design. This might have reduced the efficacy of rifampicin because, at the time, phenobarbitone was a common treatment for pruritus. For a relevant comparison, the unblinded trial by Bachs and colleagues should have compared the intervention to a placebo as well.

Two studies were extracted from the review on “Antidepressants for smoking cessation” (CD000031) even though the patient sample differed greatly. The blinded trial by Planer and colleagues compared bupropion against placebo in smokers hospitalized with acute coronary syndrome while the unblinded trial by Wittchen et al. compared the use of cognitive behavior therapy (CBT) versus CBT + bupropion in regular smokers seen in primary care.

Another analysis used four studies from a review on “Anticonvulsants for alcohol dependence” (CD008544). The studies, however, used different anticonvulsants as the intervention: two used topiramate, one used pregabalin, and one oxcarbazepine. A similar problem appeared in a review on the treatment of reflux disease (CD002095). Of the four included trials, three used omeprazole as the intervention, and one used pantoprazole. All four trials used different drugs in the control group, namely nizatidine, ranitidine, famotidine, and cimetidine. Another analysis included three studies from a review on “Hormone therapy for sexual function in perimenopausal and postmenopausal women” (CD009672). Again, the intervention differed in each of the three included trials namely, estradiol and dienogest, oestrogen and medroxyprogesterone, or 17-β-estradiol and norethisterone acetate.

Too little data, too few trials

And so it goes on and on. Because data were extracted randomly, Moustgaard et al. may have hoped that these differences would be balanced out as long as one has enough trials. But as we have seen most of the 132 trials came from 2 reviews whose outcomes were not relevant to measuring the effects of blinding participants in clinical trials. In 17 out of 18 comparisons, the number of trials in either the blinded or unblinded group was 2 or lower. In other words, each of these comparisons depended on 1 or 2 trials: if these had some peculiarities or an ambiguous blinding status, it could have distorted the entire result. This fact alone indicates that the MetaBLIND study lacked the resources to estimate the effect of blinding reliably.

Only the review “Antibiotics for sore throat” (CD000023) provided more than 2 trials on each side of the comparison. 6 trials were rated as unblinded or unclear while 9 trials were rated as blinded. Again, there were major differences between trials. While some recruited soldiers in the army, others focused on children or the general population presenting with a sore throat in general practice. Some trials included only patients who were positive for Group A streptococci, while others also included patients who might have had a viral infection (where antibiotics are of little use). Some trials tested the use of intramuscular penicillin, others used oral sulphonamide, erythromycin, aureomycin, terramycin, chlortetracycline, or phenoxy-methytpenicillin. 5 of the 6 studies in the unblinded/unclear group were published in the 1950s. In short, despite having more trials, this analysis may not provide reliable comparisons either.

It seems that the main idea behind the study design of MetaBLIND was that small differences between trials would be balanced out by a large dataset. Instead, the number of trials was small and the differences between them were large.

What is the intervention and what is the control?

There are also more subtle issues such as what one considers to be the intervention and what the control condition. An important reason why trials should be blinded is that expectancy effects might influence how patients rate their symptoms. If they know they are in the ‘intervention group’ they might be more optimistic about their health than when they know they are only receiving the ‘control condition’.

If an open-label trial however uses multiple treatment arms this might reduce expectancy effects for each of them. Therefore the results might not be directly comparable with a blinded trial that compared one drug against a control. This was, for example, the case in a comparison of 2 studies taken from a review on “Clonidine premedication for postoperative analgesia in children” (CD009633). The unblinded study by Schmidt and colleagues had a third arm with the alpha-2 agonist dexmedetomidine which might have reduced expectancy effects for clonidine.

Another problem is evident in a comparison of two trials from the review “Paracervical local anaesthesia for cervical dilatation and uterine intervention” (CD005056). The blinded study by Mankowski and colleagues was rated as high quality. It found no difference between paracervical block and intracervical block. In contrast, the smaller and low-quality trial by Yacazi et al. found an enormous difference (a standardized mean difference of 2.08) in favor of paracervical block. Moustgaard and colleagues viewed intracervical block as the intervention and paracervical block as the control condition. Therefore this comparison was given a ratio of odds ratios of 36.89 which speaks strongly against an effect of blinding. This is questionable because the trial by Yacizi also included a placebo group. The study compared two forms of anesthesia with a placebo and a fourth group where both interventions were combined. To participants, it might not have been clear that intracervical block was the intervention and paracervical block the control condition. The fact that the unblinded trial found a much stronger effect than the blinded trial, could also be seen as supporting rather than contradicting the importance of blinding.

In the pdf document attached, I have listed issues with each of the 18 comparisons. I would be interested in hearing your thoughts about whether they make sense.

I’ve contacted the first author of the MetaBLIND study, Helene Moustgaard, more than a week before publishing this blog post and she kindly sent me the following response (see pdf document below). Although interesting, I do not think it adequately addresses the points I’ve raised.

5 thoughts on “Problems with the MetaBLIND study

  1. Simon McGrath says:

    Hi. This looks really impressive.

    I’ve looked at the reply from the authors, who say they have addressed all the issues you raise in their original paper and a subsequent one. I’ve no idea if this is right.

    I be interested to know if they address your point that half the studies relate to either informed consent or decisions that people face about health screening or treatment decisions. That doesn’t seem to be addressed by their letter.

    The letter does say they address the:

    “risk of misclassification of trial blinding status, and the impact of differential vs. non-differential misclassification”

    I’d be interested to know how that is supposed to address your point. Certainly, including cases where an blinding blinding status is uncertain as “unblinded” seems to introduce clear bias (as opposed to, for instance, disregarding the ambiguous status studies).

    1. ME/CFS skeptic says:

      Hi Simon,

      Thanks for your response.

      The authors referred to a previous paper they published on how to interpret findings of the MetaBLIND study. This paper addresses some limitations of the meta-epidemiological approach they used but I don’t think it addresses the main issues I’ve raised here.

      The paper is: Ten questions to consider when interpreting results of a meta-epidemiological study—the MetaBLIND study as a case.

      1. Simon McGrath says:

        Thanks. Yes, I saw that but unfortunately don’t have the energy to read either their original paper of the supplemental one they quoted.

        I am particularly interested in what justification for including apparently irrelevant studies not looking at outcomes in medical trials, could you summarise any points they made?


  2. eindt says:

    Thank you for looking more closely at metaBLIND. It just shows the importance of having the underlying data available on publication – the problems you’ve identified required information that was not available in the original paper.

    It is good that the requested data was shared by the authors but I find their response to the issues you raised a bit disappointing. They say:
    “we find that the issues you raise have been discussed in our paper1, and, at greater length, in the accompanying publication Ten questions to consider when interpreting results of a meta-epidemiological study: the MetaBLIND study as a case2.”

    The specific issues raised here are not discussed in those papers.

    They may note the potential problems with the approach they have taken but it would be difficult to argue that readers of their BMJ paper would not come away with a different view of their results if that paper had detailed the issues raised in this blog. Though I realise this is not generally how academia works, I would still hope that the authors would want to help ensure that readers of their paper would be more fully informed of the reasons to doubt the value of the results presented, especially given the signs that it is already being used to dismiss concerns about a lack of blinding in clinical trials by other researchers.


Leave a Reply