We identified 19 eligible articles, published between August 2017 and December 2019, describing thirteen individual reviews. Ten reviews [17,18,19,20,21,22,23,24,25,26] were reported within peer-reviewed journals, of which five had previously also appeared in the form of one or more conference abstracts [27,28,29,30,31,32]. Three further reviews [33,34,35] were described in conference proceedings only. A flow diagram is shown in Additional file 2, and eligible reviews are summarised in Additional file 3.
Description of relevant reviews
All trials included in eligible reviews investigated the addition of one or more treatments, such as abiraterone, celecoxib, docetaxel, and zoledronic acid, to the standard-of-care of androgen deprivation therapy (ADT) compared to ADT alone, or a combination of these treatments [36, 37]. One large adaptive trial [38] compared multiple research treatments under the same protocol, such that data from 14 randomised comparisons were represented across the reviews from within nine trial protocols. Each review used data from between three and twelve randomised comparisons (Fig. 1), comprising between 1773 and 7844 patients. A matrix of the trials and treatment comparisons from each review is shown in Fig. 1, and the theoretical network resulting from analysis of all such data simultaneously is shown in Fig. 2. The relevant source data from each of the relevant trials is given in Additional files 4 and 5.
Sources of variation
We observed considerable variation between the included reviews in terms of review aims, eligibility criteria and included data, statistical methodology, reporting and inference.
Review aims and funding sources
All 13 eligible reviews either stated or implied an aim to identify optimal treatments for hormone-sensitive prostate cancer. Two reviews stated additional specific aims of including updated results [22] and/or improved methodology [21, 22]. Four aimed to evaluate treatment efficacy within pre-defined patient subgroups [20, 23,24,25], and four stated the aim of incorporating health economic considerations [25] or adverse effects [18, 23, 35].
Eight of the 13 reviews did not report funding sources or claimed no conflicts of interest, with a further three declaring links to industry but without a direct conflict of interest with any included trials [25, 31, 33, 34]. Of the remaining two reviews, one [24, 27] was directly sponsored by the funders of an included trial [13], with the stated aim of placing that trial in context of a specific patient subgroup. The other [22, 28] shared an academic institution with an included trial [36, 37, 39], although there were no common funding sources external to the institution. Multiple trial investigators were named as co-authors to this review due to the collaborative nature of the project.
Included trials
Seven of the 13 reviews described themselves as “systematic” in their title or abstract [20, 22,23,24, 33,34,35], and a further 4 [18, 21, 25, 26] described themselves as such at least once elsewhere in their reports. All but one [17] reported that a formal search strategy had been used, although only five [18, 21, 23, 24, 26] referenced the PRISMA guidelines [4] or presented a review flowchart. All reviews specified a disease setting of hormone-sensitive prostate cancer (HSPC), and only included randomised controlled trials (RCTs). Nine reviews [17, 18, 20,21,22,23, 25, 33, 34] specified that trials must include a control arm of ADT alone. Of the remainder, only one review [24] included the direct comparison of abiraterone vs docetaxel from the STAMPEDE platform trial, first available as a conference abstract in September 2017 [40] and therefore potentially also eligible for other reviews (see Additional file 3). Eight reviews [17, 19, 21, 22, 26, 33,34,35] aimed to include trials in metastatic disease (M1). Two further reviews [24, 25] narrowed their target to metastatic high-volume disease (M1 HVD), of which one [24] additionally restricted to newly diagnosed (that is, untreated) M1 HVD but presented sensitivity analyses including data from other clinically-relevant trials with broader inclusion criteria. By contrast, three other reviews explicitly broadened their criteria to include trials in the high-risk [20] or locally advanced [18, 23] non-metastatic setting, although one [18] ultimately limited their analysis to M1 trials due to lack of data.
Included treatments
The set of included treatments varied depending upon the aims of the review. Eight reviews [17, 18, 23,24,25, 33,34,35] only included data comparing docetaxel or abiraterone plus ADT to ADT alone, reflecting the focus of clinical interest. Four others included at least one additional treatment combination from the STAMPEDE platform trial [36, 37]. Two such reviews [19, 20] included the zoledronic acid plus docetaxel combination, treating this simply as additional docetaxel data. The remaining two [21, 22] included all published results from STAMPEDE where a treatment combination was compared to ADT alone, plus other trials with data on similar comparisons (Fig. 1); and performed network analysis (see “Statistical methods” section, below). One such review [22] gave an explicit justification for the exclusion of one particular treatment (sodium chlodronate), referring to earlier work [7] where the treatment was considered separately due to “differences in mechanisms of action” and because it “is not commonly used in practice”. By contrast, two other treatments rarely used in recent times (estramustin phosphate and flutamide [41, 42]) were included, without explicit justification, in a different review [26].
Included participants
Patient inclusions were necessarily governed by the reported data from eligible trials. The vast majority of included trials (see Additional files 4 and 5) conformed to the intention-to-treat principle [43]; the exceptions being two small, older trials [26, 41, 42] where small numbers of patients were not analysed due to protocol deviation or non-eligibility.
Some reviews applied additional inclusion criteria within the HSPC setting, most commonly to metastatic disease (M1; see “Included trials” section above). One of the largest relevant trials (STAMPEDE [36, 37, 39]) randomised men both with M1 and high-risk non-metastatic (M0) disease; but many results were reported within patient subgroups such that M1 men could be included within M1-only reviews. However, it was not always clear that review authors extracted or analysed these data consistently. For example, one review [17] specified that only M1 men were eligible, but the reported data suggested that the STAMPEDE result for all randomised patients (that is, M0 and M1 combined) had been extracted.
Only two reviews [20, 23] investigated patient subgroups other than M0/M1 or HVD: looking at age, performance status, Gleason score and presence of visceral metastases. Neither used the recommended “deft” approach to testing for subgroup interactions in the meta-analytic context as recommended by Fisher et al. [44].
Included outcomes
All reviews focussed on time-to-event outcomes reported on the relative (hazard ratio) scale. Eleven of the 13 reviews reported overall survival (OS) results (Additional file 3), generally thought to be the most clinically relevant outcome in this setting [45] and for which there was a consistent definition across trials and meta-analyses. Ten reviews reported results on intermediate (secondary) outcomes based around the time to disease progression (Additional file 3), but there were notable differences between reviews in how such data were handled. Precise outcome definitions varied between trials, and some trials reported effect sizes for multiple intermediate outcomes. Because of this, one review [20] considered that such data were “not reported consistently enough between trials to allow for pooling”. Three reviews [21, 24, 26] imposed a specific definition of the intermediate outcome, with the aim of maximising consistency but at the risk of trial exclusions and loss of information. By contrast, two reviews [22, 23] argued that intermediate outcome definitions were sufficiently similar as to allow clinical interpretation of the pooled result, selecting the most prominent estimate from trials where more than one definition was used. One such review [22] explicitly reported their observations regarding heterogeneity of definitions, and included a discussion of the potential impact on review conclusions (see “Comparison of primary results and of reviewers’ interpretations” section). The remaining reviews did not provide sufficient information to determine how intermediate outcome data were handled. Additional outcomes were considered by some reviews in accordance with specific review aims (see “Review Aims and Funding Sources” section), but are not within the scope of this case study.
Included results
Two of the included trials (see "Included trials" section) reported “long-term” results subsequent to their primary analysis reports, to allow secondary outcomes sufficient time to mature [12, 46,47,48]. Particularly in a time-to-event context, updated results can increase power and precision by capturing additional events [49]. Although three reviews explicitly stated that data from the most recent available trial report would be used [18, 20, 22], many others were inconsistent or unclear. For example, one review [19] referenced updated results for an included trial [47] but appeared to use an older set of results [46] in their analysis. Updated OS results from another trial were reported in a conference abstract [48], with intermediate outcome results presented at the conference itself. However, only a single review [22] incorporated these results in place of older published results for that trial [12].
Statistical methods
A wide range of statistical methods were used. Three reviews [17, 33, 34] simply carried out pairwise meta-analyses of included treatments versus standard-of-care, with inference for indirect comparisons based upon a test of subgroup difference [50]. A more common approach, used in five reviews [18,19,20, 23, 25], was the “Bucher method” [51], applicable to three-treatment triangular networks but which has been criticised for estimating a separate heterogeneity variance for each comparison [50]. Two reviews [19, 20] accommodated the “docetaxel plus zoledronic acid” comparison from STAMPEDE within such a framework by treating it as an additional docetaxel comparison, reflecting a similar approach sometimes used in pairwise meta-analysis [52]. Four others analysed networks of four or more treatments using mixed treatment comparison (MTC) methods, either using frequentist multivariate analysis [22] or a Bayesian framework [21, 24, 26]. Such methods allow indirect evidence to contribute to effect estimation, which can increase precision [53]. Overall, of the nine frequentist reviews, six used random-effects modelling; one [18] used common-effect modelling; one [19] used a hybrid method (see Additional file 3); and one [25] was unclear. Only one review [22] reported network inconsistency or heterogeneity statistics. No reviews adjusted for any trial-level factors.
Due to its adaptive multi-arm design [38], multiple treatment comparisons from the STAMPEDE trial may be correlated. If a review includes such comparisons as though they were independent trials, double-counting of control arm observations may lead to inflated variances. However, only three reviews [21, 22, 24] explicitly discussed this issue, despite it being highlighted in the PRISMA-NMA statement [4]. One such review [21] stated that “treatment comparisons … from the same study were modelled … with a [Bayesian] correlation prior distributed uniformly on 0–0.95”. Another [22] sought to estimate the correlations themselves using event counts by treatment arm. Both also included zoledronic acid combination arms separately from docetaxel and celecoxib alone, which added strength to the docetaxel network comparison. The remaining review [24] was unique in including direct comparison data from STAMPEDE of abiraterone vs docetaxel [40, 54]. Despite correctly noting “differences in the period of enrolment” between the direct comparison and the original comparisons against ADT, and “uncertainty in the extent of overlap of populations for each of the comparisons” [24], they did not attempt to formally account for this, choosing instead to perform sensitivity analyses.
Reporting
Three reviews were reported in conference proceedings only [33,34,35], and a further two [17, 26] took the form of “letters to the editor” rather than full research articles; understandably, these all conformed poorly to PRISMA-NMA guidelines [4]. Although the eight fully peer-reviewed articles generally conformed better (see Additional file 6), risk-of-bias assessments and handling of multi-arm trials were common omissions, and only two reviews [22, 23] published their protocol in advance. There was also some evidence of outcome reporting bias, for example one review [26] presented an indirect estimate for the intermediate outcome but not for overall survival, despite evidence that both outcomes were analysed. Reporting of source data and description of statistical methodology was often poor, making it difficult to recreate the reported indirect treatment comparisons. Inconsistencies in use of source data, and minor reporting errors such as inconsistent patient or event counts, further hindered attempts to make reasonable judgments as to how such analyses might be recreated.
Comparison of primary results and of reviewers’ interpretations
Twelve of the 13 reviews analysed overall survival (OS), of which 9 explicitly reported an indirect estimate of abiraterone versus docetaxel. Despite the differences described above, results were fairly similar, with HRs of around 0.80 (range 0.79 to 0.88) and of borderline significance at the 5% level (Fig. 3a). The most obvious discriminating feature was a wider confidence interval from two reviews [24, 25] which reported specifically on the high-volume disease (HVD) sub-population. Overall, eight reviews [17,18,19, 23, 33, 34] (including three MTC-based reviews [21, 22, 24] and one of the HVD-only reviews [24]) drew tentative conclusions regarding an OS advantage for abiraterone over docetaxel. By contrast, three reviews [20, 25, 35] stated categorically that there was no difference in OS; the conclusions for the final review [26] were unclear. Notably, conclusions differed among three reviews including an identical set of trials: two [18, 20] stated explicitly that their analysis did not demonstrate statistical significance, whilst the third [19] stated that “despite several limitations stemming from the paucity of comparative evidence, our results favour [abiraterone] over [docetaxel]”. This would appear to be due to a notable difference in effect size between two reports of the same trial [46, 47] (see “Sources of variation” section), with one review [19] extracting from the earlier report.
Of 10 reviews which analysed an intermediate outcome, seven reported indirect estimates [18, 21,22,23,24,25,26] and a further two, both conference abstracts [33, 34], reported sufficient details of methods and associated results for the relevant estimates to be accurately recreated. Due to the variations in intermediate outcome definition (see “Sources of variation” section), we took the results most prominently presented or described in each review (see Additional file 3). The estimates here were more varied, with HRs ranging from 0.48 to 0.84 (Fig. 3b). Much of this heterogeneity may be explained by the two HVD-only reviews [24, 25] which reported noticeably smaller effect estimates. Of these, one [24] concluded that a “positive trend” was seen both in overall survival and in the intermediate outcome, whilst the other [25] stated that “no statistically significant difference” was seen. The third non-significant result in Fig. 3b is taken from a review [26] for which descriptions of methodology and source data were particularly limited, and we were unable to recreate their analysis.
In the remaining six reviews [18, 21,22,23, 33, 34], the intermediate outcome results were all strongly significant at conventional levels, and this was reflected in the reviewers’ conclusions. However, effect size heterogeneity is still apparent in Fig. 3b, with HRs of around 0.60 reported by three reviews [18, 22, 23] and of around 0.50 by three others [21, 33, 34]. One review [22] carried out sensitivity analyses of the choice of intermediate outcome effect from specific trials, and saw results consistent with both effect sizes. Another [21] imposed restrictions on intermediate outcome definitions to improve consistency, excluding two trial results [12, 13] included elsewhere. This a priori decision was justified by the review authors, and its potential limitations acknowledged. The remaining observed review-level heterogeneity would appear to be due to one trial [13] reporting two intermediate outcome effect size estimates that differed noticeably from each other (see Additional file 5).