Approaches to interpreting and choosing the best treatments in network meta-analyses

When randomized trials have addressed multiple interventions for the same health problem, network meta-analyses (NMAs) permit researchers to statistically pool data from individual studies including evidence from both direct and indirect comparisons. Grasping the significance of the results of NMAs may be very challenging. Authors may present the findings from such analyses in several numerical and graphical ways. In this paper, we discuss ranking strategies and visual depictions of rank, including the surface under the cumulative ranking (SUCRA) curve method. We present ranking approaches’ merits and limitations and provide an example of how to apply the results of a NMA to clinical practice.


Background
Systematic reviews of randomized clinical trials (RCTs) provide crucial information for determining the effect of interventions in clinical practice [1]. Typically, investigators statistically combine treatment effect estimates (effect sizes) from individual clinical trials [2]. Traditional metaanalyses compare a single intervention to a single alternative (direct pair-wise comparisons) [3].
In many clinical contexts, clinicians consider more than two alternative treatments, each of which may have been compared to standard care, a placebo, or an alternative intervention. Because some interventions have never been compared to a placebo, or lack head-to-head direct comparisons, choosing between a number of alternatives creates challenges for determining their relative merit [4].
A solution to the multiple alternative problem that uses an entire body of evidence with all available direct and indirect comparisons-termed network meta-analysis (NMA) or multiple treatment comparison meta-analysis-is seeing increasing use [5]. In addition to providing information on the relative merits of interventions that have never been directly compared, NMAs may also increase the precision of effect estimates by combining both direct and indirect evidence.
However, the results of NMAs may be complex and difficult to interpret for clinicians especially when there are many alternative strategies and outcomes to consider [6,7]. Guidance on how to interpret findings from NMAs remains limited [8]. To address interpretation challenges, NMA authors can complement numerical data with graphical tools [9][10][11] and by ranking interventions. Indeed, some form of ranking is reported in two thirds of all published NMAs [7], and experts recommend ranking as a form of presentation [12].
Other discussions have addressed reporting options, including ranking approaches, often assuming that readers have a sophisticated knowledge of analytic methods [9,10]. Our objective here is not to be technical or comprehensive, but rather to discuss the merits and limitations of ranking methods with a specific focus on surface under the cumulative ranking (SUCRA) curve, a popular ranking method.

Ranking treatments
Clinicians wish to offer patients a choice among the most desirable treatment options. Though a treatment that is certain to be the best in terms of the most important benefit outcome (e.g. a reduction in risk of stroke) would be a strong candidate for the treatment of choice, it might also carry more harms than other options (e.g. greatest risk of bleeding, or greatest burden).
Moreover, results of studies are always associated with uncertainty and we will seldom, if ever, be sure a treatment is best. Rather, we can think of the likelihood that, for a particular outcome, a treatment is best, or near best. Of two treatments that are unlikely to be the best, the treatment with a higher likelihood of being second best would-all else being equal-be preferable to one with a lower likelihood of being second best. Ranks can be presented graphically and numerically. The graphical approaches involve examining the area under the curve indicating the probability of each drug to occupy a specific rank. These graphs are daunting to compare, especially when many treatments and outcomes are examined.
The surface under the cumulative ranking curve (SUCRA) is a numeric presentation of the overall ranking and presents a single number associated with each treatment. SUCRA values range from 0 to 100%. The higher the SUCRA value, and the closer to 100%, the higher the likelihood that a therapy is in the top rank or one of the top ranks; the closer to 0 the SUCRA value, the more likely that a therapy is in the bottom rank, or one of the bottom ranks.
Applying these methods to a real-life example An NMA studied the impact of alternative resuscitative fluids on mortality in adult patients with sepsis [13]. We present here the results from an analysis that divided the intervention into six categories: albumin, balanced crystalloid, saline, gelatin, heavy starch and light starch. Figure 1 depicts the rankings of these six treatments.
From Fig. 1, we can see that balanced crystalloids have the highest likelihood of being ranked first, followed by albumin, gelatin and heavy starch; the results suggest no possibility that light starch and saline lead to the lowest mortality. For the second rank, balanced crystalloids and albumin still appear most likely and light starch and saline least likely, but heavy starch now has a higher likelihood than gelatin. Gelatin, the two starches, and saline are more likely to be among the lower ranks (3 to 6), and albumin and balanced crystalloid far less likely to be among the lower ranks. Looking across the figures, you could make an intuitive estimate of the rankings, and the gradient in effect across treatments. Table 1 presents the SUCRA results that emerge from these data. The SUCRA rankings confirm that balanced crystalloid and albumin are most likely to result in the lowest mortality (with quite similar SUCRA scores) while light starch appears appreciably less attractive than the other alternatives.

Five reasons why these rankings may mislead if not interpreted correctly
Taking these results at surface value, clinicians should now be resuscitating all their septic patients with a balanced crystalloid solution. There are, however, several reasons why clinicians should not routinely choose a treatment with the higher SUCRA ranking. First, the evidence on which the SUCRA rankings are based may be of very low quality (synonyms: low certainty or confidence) and therefore untrustworthy. Second, there are typically several relevant outcomes. A treatment that is best in one outcome (say, a benefit outcome) may be the worst in another outcome (for example, a harm outcome). Third, issues such as cost and a clinician's familiarity with use of a particular treatment may also bear consideration. Fourth, in the Fig. 1 Graphical ranking of resuscitation fluids in six-node analysis process of calculation, SUCRA does not consider the magnitude of differences in effects between treatments (e.g. in a particular simulation the first ranked treatment may be only slightly, or a great deal better than the second ranked treatment). Fifth, chance may explain any apparent difference between treatments, and SUCRA does not capture that possibility.
In this case, clinicians may easily misinterpret the apparently clear hierarchy in the efficacy of these fluids in reducing mortality. Table 2 presents a more detailed summary of the evidence, including the number of direct comparisons, the direct, indirect and network estimates and their associated credible intervals, and the certainty (quality, confidence) of the evidence.
This body of evidence demonstrates the most compelling reason to potentially mistrust rankings in general and SUCRA in particular: they may arise from evidence warranting low or very low certainty. A set of SUCRA ratings may arise from a large body of studies with few limitations and high certainty in the evidence. Exactly the same set of ratings may arise from a small body of studies with major limitations in risk of bias (unconcealed randomization, lack of blinding, large loss to follow-up), imprecision (wide confidence intervals or small number of events), inconsistency in results, indirectness (for instance, studies enrolling a sample of patients that differ from the population of interest, or measuring outcomes differently, such as with shorter follow-up), and publication bias-and thus warrant only low or very low certainty.
In this case, because of a high risk of bias, imprecision, inconsistency, and indirectness, of the 15 paired comparisons, 5 warrant only very low certainty, 5 low certainty, 5 moderate certainty, and none high certainty. Of the moderate certainty comparisons, only 1, balanced crystalloid versus low starch, showed a statistically significant (i.e. p < 0.05) difference between treatments; all the other moderate certainty ratings failed to show a statistically significant difference between treatments (indeed, none of the other 10 paired comparisons showed convincing differences either).
Because of the low or very low quality evidence underlying most comparisons, the SUCRA ratings will result in misleading inferences if taken at face value. For instance, we may reasonably infer from Table 2 that balanced crystalloids are very likely to result in lower mortality than light starch. We cannot be at all certain,  -rated down for imprecision, 2 -rated down for indirectness, 3 -rated down for inconsistency (I 2 = 80%, p = 0.03 for heterogeneity), 4 -rated down 2 levels for imprecision however, that the differences between balanced crystalloid and albumin, or even balanced crystalloid and heavy starch, are real and important. Indeed, and perhaps wisely, reviewers of the NMA felt that the risk of misinterpretation of rankings in general and SUCRA in particular was in this case so great that they insisted on their omission from the published manuscript [13]. However, most clinicians are likely to find interpretation of Table 2 data challenging. Indeed, this is likely to be the case whenever an NMA includes more than three or four interventions. Therefore, despite their limitations, alternative presentation formats are likely to be helpful.

An alternative summary presentation
Given the risks of relying primarily on rankings, and the cognitive challenges of processing tabular presentations such as Table 2 (which has the benefit of capturing all the key evidence), there is another potentially helpful presentation format for NMAs. This format involves a visual representation of point estimates and certainty or confidence intervals comparing NMA estimates of each treatment against a constant comparator. In NMAs comparing alternative drug therapies, that common comparator may be a placebo or standard care. In this case, we have chosen the lowest ranked treatment, light starch (Fig. 2) as the common comparator. This visual representation facilitates appropriate inferences: (i) point estimates suggest that all treatments (with the exception of gelatin, with a point estimate of 1.0) are superior to light starch; (ii) any true differences between balanced crystalloid and albumen are likely to be small; (iii) differences between these two treatments and the other four may be considerably larger and (iv) the extent of the overlapping confidence intervals considerably diminishes our certainty about inferences (i) to (iii).
While potentially helpful, and in particular at least to some extent avoiding the excessively strong inferences that the unwary clinician might make from SUCRA rankings, this presentation format also has limitations.
First, it deals only with a few of the comparisons in Table 2, and a full picture of the evidence requires a consideration of the other comparisons. Second, while capturing issues of precision, it tells us nothing about risk of bias, indirectness, and publication bias, and a limited amount about inconsistency (if the analysis is based on random-rather than fixed-effect models, inconsistency may contribute to widening of confidence intervals). Third, using a common comparator to which many interventions have not been compared may lead to wider confidence intervals, leading to less secure inferences than the data may warrant.
However, in our current example, all else being equal, the evidence from the visual display of rankings, from the SUCRA ratings, and from the visual depiction of comparisons with light starch all suggest that choosing either balanced crystalloid or albumin as the initial resuscitation fluid may be advisable. At least one inference is very secure: light starch is a poor choice of resuscitation fluid.

Conclusions
We acknowledge some limitations in this work. Our descriptions are based on one example in which the differences between the effects of the resuscitation fluids is not very large, and therefore careful consideration is required in selecting the best option.
Appropriate interpretation of NMA results involves presentation of direct and indirect as well as the NMA estimates and their associated confidence/credible intervals for each paired comparison, as well as the associated certainty of estimates (as in Table 2). When the NMA involves more than three or four interventions; however, the cognitive challenge of optimally interpreting such evidence summaries is daunting. Visual displays of rankings ( Fig. 1), the SUCRA statistic (Table 1), and visual displays of point estimates and confidence intervals of relative effects of interventions against a common  (Fig. 2) can all aid in interpretation when used together.
Clinicians using NMAs should bear in mind that the presentation approaches we have described all have their limitations and require cautious interpretation. If interpreted in the light of certainty (quality and confidence) in the evidence, clinicians can avoid misleading inferences. They can then use best evidence presentations from NMA to guide their clinical practice and offer patients optimal choices in managing their health issues.
Abbreviations NMA: Network meta-analysis; SUCRA: Surface under the cumulative ranking