Earlier this week, a new systematic review of interventions for Long Covid was published in the BMJ. It recommends cognitive behavioral therapy (CBT) and physical and mental health rehabilitation, claiming there is moderate certainty evidence that these two interventions improve symptoms of Long Covid. Unfortunately, the review has multiple issues that make its conclusion questionable. It wouldn’t surprise us if a correction would follow. In this blog post, we give an overview of the main problems.
No pooling of data
The review by Zeraatkar and colleagues summarizes data from 24 trials on adults with Long Covid. Eight trials tested physical activity or rehabilitation, and three focused on behavioral interventions such as CBT. The abstract states that CBT reduces fatigue and concentration problems, and that physical and mental health rehabilitation leads to more recoveries, less depression, and better quality of life.
The overview gives the impression that there are multiple trials that found these effects but that is not the case. All the effects mentioned are based on only one trial. No pooling or synthesis of results took place. For each outcome, there was data from only one study. In the case of CBT, it was a Dutch trial called ReCover, and for rehabilitation, it was the REGAIN study. The review is restating the findings of these trials. It’s rather misleading that the abstract does not clarify this. Reviews normally mention the number of trials and participants for each outcome, which is what PRISMA (a guideline for reporting systematic reviews) recommends.
Recommending insignificant effects
Even though every outcome is based on a single trial, the reviewers recalculated treatment effects using summary statistics. They did a meta-analysis with only one trial, so to speak. This leads to different estimates from those found in the trial publications, as the latter had access to the full dataset (rather than just the means and standard deviations) and could control for covariates, stratification variables, and baseline values.
The review claims, for example, that REGAIN rehabilitation “probably reduces symptoms of depression” based on a mean difference on the HADS scale of -1.5 points (95% confidence interval: -2.41 to -0.59). However, the REGAIN study reported a lower estimate of -0.952 (95% CI: -1.675 to -0.229). This is below the minimal important difference (MID) of 1.5 points, suggesting it was so small it was likely not clinically significant. Another example: the review claims that REGAIN “probably improves quality of life” based on an estimate of 0.04 points (95% CI: 0.00 to 0.08) on the PROMIS 29+2 Profile v2.1 questionnaire. The REGAIN trial, however, reported a difference of only 0.03 points (95% CI: 0.01 to 0.06), which was lower than the MID of 0.04. In other words, the review makes (rather strong) recommendations based on differences that were not clinically significant in the original trial report.
Imprecision
The examples above illustrate the absurdity in the recommendations, but the main problem of the review lies in how imprecision was handled. What matters are not the point estimates but what the confidence intervals say, the numbers between brackets. These indicate the range of results we might get if the trial was repeated multiple times. The bigger the sample size, the smaller the confidence interval becomes and the more precise our estimate is. If a confidence interval includes values that are below the MID, we are less certain that the intervention is useful. In that case, GRADE guidelines recommend downgrading the certainty of evidence with one or more levels.
The BMJ review used this GRADE approach for rating outcomes of hyperbaric oxygen therapy and transcranial direct current stimulation. These effects were at low risk of bias, but because they were not very precise, they were downgraded for imprecision with two levels. They went from high certainty to low certainty of evidence.
Intervention | Outcome | Estimate and CI | MID | Does CI cross MID? | Downgraded for imprecision? |
Hyperbaric oxygen therapy | BSI-18 (0-72) Mental health 10 weeks | -7.1 (-12.23 to -1.97) | 6.2 | Yes | Two levels |
Transcranial stimulation | MFIS fatigue (0-84) 5 weeks | -12.4 (-17.33 to -7.47) | 7.48 | Yes | Two levels |
ReCOVer CBT | CIS-Concentration (5-35) 24 weeks | −5.2 (−7.97 to -2.43) | 3.4 | Yes | No |
REGAIN Rehabilitation | HADS-depression (0-21) 52 weeks | -1.5 (-2.41 to -0.59) | 1.5 | Yes | No |
REGAIN Rehabilitation | PROPr quality of life (0.022 – 1) 52 weeks | 0.04 (0 to 0.08) | 0.04 | Yes | No |
The strange thing is that the reviewers did not apply this rule to the results of CBT and REGAIN rehabilitation highlighted in the abstract. Their confidence intervals cross the MID, suggesting they should also have been downgraded for imprecision as well. As the table above shows, it’s not a close call: a slightly different choice for a MID would not have made a big difference because the effects are small and confidence intervals wide. Still, Zeraatkar and colleagues did not downgrade these outcomes for imprecision at all.
For some estimates, the confidence intervals lie fully above the MID. In that case, GRADE guidelines recommend checking if the sample size was large enough. It often happens that big effects in initial trials are not replicated when more data is collected. GRADE suggests determining if enough information was provided to have confidence in the results by calculating the ‘optimal information size’ (OIS). The OIS is the sample size needed to test a small effect in a single trial with 80% power. For continuous outcomes, it is approximately 800 participants (400 in each arm).
The BMJ review applied this reasoning to the quality of life outcome of transcranial direct current stimulation. The effect was large, and the confidence intervals (8.86 to 20.74) were fully above the MID of 6.66. However, because the trial only included 70 participants, it was downgraded by two levels for ‘very serious imprecision.’ The CBT trial, however, had a similar estimate and a sample size of 114 that wasn’t much bigger. Yet, this trial was not downgraded for imprecision at all.
Intervention | Outcome | Estimate and CI | MID | Sample size | Sample size lower than OIS? | Downgraded for imprecision? |
Transcranial stimulation | World Health Organization quality of life questionnaire (0-100) 5 weeks | 14.8 (8.86 to 20.74) | 6.66 | 70 | Yes | Two levels |
ReCOVer CBT | CIS-Fatigue (8-56) 24 weeks | -8.4 (-13.11 to -3.69) | 3 | 114 | Yes | No |
REGAIN Rehabilitation | Recovery/ Improvement (per 1000 participants) 52 weeks | 161 (61 to 292) | 50 | 442 | Yes | No |
The same is true for the recovery/improvement outcome for REGAIN rehabilitation. With a baseline risk rate of 0.9, the OIS for a small effect would be approximately 910 participants, much more than the 442 in the REGAIN trial. So, it seems that this outcome should have been downgraded for imprecision as well. This is not very surprising. These results came from one trial with a modest sample size, so, logically, we aren’t very confident about the results.
High Risk of bias
The inconsistency in how imprecision was handled is quite a big deal. Downgrading with two levels (or not) has a big impact, as there are only four levels in the GRADE system. Outcomes of randomized trials start with high certainty but can be downgraded to moderate, low, or very low certainty evidence. Not downgrading the CBT and REGAIN results, therefore, makes all the difference.
The outcomes for CBT and REGAIN were rated as having a high risk of bias. This means that they are likely distorted by weaknesses in study design. Both trials were open-label and used subjective questionnaires. Patients knew who was getting the intervention, and those in the control group did not receive the same amount of care and attention. Therefore, the trial endpoints likely reflect reporting biases and placebo effects. Zeraatkar and colleagues incorporated this risk of bias in their review by downgrading the outcomes by only one level.
This explains the recommendations of the review. The results on hyperbaric oxygen therapy and transcranial stimulation were at low risk of bias. But because they were (correctly) downgraded for imprecision, they are only low-certainty evidence. The CBT and REGAIN outcomes were at high risk of bias, but because they were not downgraded for imprecision, they were evaluated as moderate certainty evidence. In other words, the issue with imprecision has led to a reversal in that the review now recommends high-risk-of-bias outcomes while ignoring similar estimates that are at low risk of bias.
Making recommendations based on a single high-risk-of-bias trial is rather controversial and far from best practice. Some researchers have argued that studies with a high risk of bias should be excluded from reviews because they are too likely to provide wrong answers.
There are also reasons to believe that the CBT and REGAIN outcomes should have been downgraded by two levels for their risk of bias. In the case of CBT, the treatment actively encourages patients to view and report their symptoms differently, for example, by no longer focusing on fatigue or avoiding catastrophizing. This means that the risk of bias is exceptionally large if you use subjective outcomes such as a fatigue questionnaire.
In the REGAIN trial, there are other problems. There was, for example, a high drop-out rate, which was higher in the intervention group (27%) compared to the control group (21%). The recovery/improvements rates that the review calculated ignore these participants. Normally, one would use an intention-to-treat (ITT) analysis that includes all participants in the group they were randomized to. The BMJ review, however, calculated recovery/improvement based on available cases, which has the unfortunate consequence that recovery rates often look better when more participants drop out.
The REGAIN study also recruited patients who were discharged from hospital because of severe COVID-19, which is not representative of the Long COVID population. In such cases, GRADE offers the option to downgrade for ‘indirectness.’ The review did not do this, arguing that “there is no evidence that currently suggests the effects of the intervention may be different based on severity of the acute COVID-19 infection.” In our view, knowing that your sample is unrepresentative of the target population is sufficient to downgrade the certainty of evidence.
Cherry picking
Another issue is that the review extracted 28 outcomes from the REGAIN trial and scanned them for a significant effect. By pure chance, some of these estimates might exceed the MID. Trials usually specify primary outcomes and correct their estimates for the number of tests conducted to avoid false positives (Type I error). The BMJ review, however, failed to take this into account.
If we take the example of REGAIN and depression, the point estimate for the HADS subscale was equal to the MID of 1.5 points at the 52-week time point. But at 12 weeks after treatment ended, the estimate was only -0.7 (95% CI: -1.59 to 0.19), indicating no effect. The REGAIN trial also included a PROMIS depression scale, which suggested the intervention likely had no important effect on depression at both timepoints. From this, the review concludes that REGAIN rehabilitation “probably reduces symptoms of depression.” This is not a good reflection of the data. Why would the effect only show up on one depression scale but not the other? Why would the effect be absent after treatment only to appear many weeks later at the long-term follow-up? A more likely explanation is that this represents a false positive finding.
Systematic reviewers are advised to consider the risk of type I error when they interpret findings based on multiple comparisons. Zeraatkar and colleagues seem to have done the opposite by scanning all 28 outcomes for a significant effect and making recommendations for just about any outcome that crosses the MID threshold, even if other evidence contradicts it.
There is an additional problem with the recovery/improvement outcome of REGAIN. The review highlights this finding by stating that “an estimated 161 more patients per 1000 (95% CI 61 more to 292 more) experiencing meaningful improvement or recovery. The problem is that this outcome was not registered in the trial registration, and there are multiple ways in which it could have been analyzed. Highlighting a secondary outcome that was not registered because it shows a bigger effect than others is a questionable approach akin to cherry-picking.
Ignoring objective outcomes
Lastly, there is the problem that the review only includes subjective outcomes such as symptom questionnaires. The authors justify this decision as follows:
“Our review relied on self-reported measures rather than observations by health professionals or biomarkers. This approach is justified since the symptoms of long covid, such as fatigue, are subjectively experienced, and no objective laboratory measures have been established to predict benefit in terms of how patients with long covid feel or function.”
This seems poorly argued. Symptoms such as fatigue are always experienced subjectively, and the lack of biomarkers for Long Covid does not mean that you can’t include objective outcomes such as actigraphy, employment, or fitness tests. It’s important to include these objective outcomes as they are more reliable than symptom questionnaires in trials where blinding is not feasible.
ME/CFS patients have previously asked to include objective outcomes in reviews. The best example is the Cochrane review on graded exercise therapy for ME/CFS. This review ignored objective outcomes which show that patients failed to improve or get fitter after exercise therapy. It has been the subject of much controversy, and an ongoing petition signed by 11,300 people and 76 ME/CFS charities called for it to be withdrawn. It’s important to not make the same mistake again in Long Covid reviews. The inclusion of objective outcomes might not seem important now that there are only a few trials, but it might be in the future.
Conclusion
The review by Zeraatkar and colleagues is fraught with inconsistencies. Several rapid responses and blog articles (here and here) have been published, highlighting many of the issues we discussed in this article.
The review is a ‘living review,’ meaning that it will be frequently updated when new information arises. Hopefully that offers an opportunity to provide some corrections. The authors “anticipate that the living systematic review will become a trusted reference point for national and international professional associations and authoritative organizations that intend to produce guideline recommendations on the management of long COVIDs.” It is quite likely that this review will have a big impact on Long Covid patients worldwide. Therefore, it’s important to get things right.
Thanks for making the effort to keep their feet to the fire – in any other reputable field, this would not be allowed.