Survival simulation
Our simulation returned 1000 studies with median 8 animals per group, IQR 7–10, range 2–23 (see Supplementary Material 2). During the simulated follow-up period (arbitrarily scaled to 50 days), 15,804 (98.8%) deaths occurred, with 196 (1.2%) surviving the duration of the experiment. The overall median survival was 8.58 days.
Cox regression of the individual data (via stcox command) suggested a significant treatment effect (HR 1.50 ± 0.024, Z = 24.8, p < 0.001), with a corresponding median survival ratio (MSR) of 1.27 (Fig. 1). Furthermore, the influence of all 7 variables intended to impact on survival was demonstrable on a multivariate Cox regression. Kaplan-Meier curves visualising survival stratified by each categorical variable group are shown in Fig. 1.
The stcox function successfully generated HR estimates for all 1000 experiments in the first simulation, with a median of 1.52 (IQR 1.02–2.62), although a small percentage returned extreme HRs favouring either treatment (11/1000, 1.1%) or control (1/1000, 0.1%) with correspondingly large standard error (see Supplementary Material 3 for examples). These were not excluded as their weighting would be diminished at meta-analysis. The median MSR for the dataset was 1.25 (IQR 0.97–1.70)—there were no extreme outliers, range 0.423–6.27. On log-transforming each statistic prior to application to meta-analysis, there was a modest correlation between lnHR and lnMSR (r = 0.37, Fig. 2A). The same direction of treatment effect was suggested by both summary statistics in 840/1000 (84%) of instances; opposite treatment effects were suggested in the remaining 160 (16%). Typically, in these instances, the efficacy estimates were modest for both measures and differing polarities could be accounted for by non-significant treatment effect or crossing Kaplan-Meier curves (see Supplementary Material 4).
The measure of error used for lnHR was the standard error generated from the stcox estimation. As MSR measurement does not inherently produce a measure of error, this is estimated using the number of animals in the experiment as a surrogate. The total number of individuals in the experiment is used in place of inverse variance as a meta-analysis weighting factor, so SE for lnMSR is thus estimated using 1/√n. There was a linear correlation between se_lnHR and se_lnMSR although standard errors were larger and of greater range for HR data (Fig. 2B). Correspondingly, there was a correlation in fixed-effects weighting with weighting values being larger for MSR data than for HR, and absolute values much higher for MSR data than HR (Fig. 2C). Consequently, the τ2 estimate for the MSR-based meta-analysis was 0.111, compared with 0.0626 for the HR approach—meaning that random-effects weighting was more consistent in the MSR meta-analysis (Fig. 2D).
Meta-analysis of HR data suggested a significant treatment effect with pooled HR of 1.50 (95% CI 1.44–1.56; t = 19.3, p < 0.001). Similarly, there was pooled MSR of 1.29 (95% CI 1.26–1.32; t = 19.0, p < 0.001). The I2 values were high for MSR data (63.7%) and low for HR data (23.5%). On univariate meta-regression, a significant predictive effect of 6 variables was identified using HR (Var1, Var2, Var3, Bin1, Bin2) and of 5 using MSR (Var1, Var2, Var3, Bin1, Bin2, Cont1). Similarly, multivariate meta-regression of these 1000 studies revealed the predictive value of 5 variables for both HR and MSR (Var1, Var2, Var3, Bin1, Bin2).
To summarise, we have found strong correlations between the HR and MSR summary statistics as well as their performance in a single large meta-analysis.
Meta-analysis and meta-regression power assessment
The simulation was repeated for 100,000 experiments containing 1.6m individuals in order to allow for an assessment of the sensitivity of each approach by repeated meta-analyses. This was done using the same parameters as the first simulation, except for the number of individuals and experiments. The median study group size was 8 (IQR 7–10). Death occurred for 98.7% of individuals during the experiment and the overall median survival was 8.39 days.
Cox regression again suggested a significant influence of treatment, with HR 1.46 ± 0.0237, Z = 234 and p < 0.001. Median survival was 9.42 days in the treatment group and 7.55 in the control, giving a MSR of 1.25. Similarly, Cox regression suggested a comparable influence of each predictive variable on survival outcome to those in the first dataset and no influence of the control variables (see Supplementary Material 5).
The iteration failed to converge in 34 instances (0.034%) of HR estimation and so these experiments were excluded from the remainder of the simulation. These experiments mostly had exceptionally small sample sizes (see Supplementary Material 6 for examples). Whenever a meta-analysis had a study excluded, it was still treated as if it were its original size (that is, of size 20 experiments instead of 19 or 100 instead of 99). There were no instances where a single meta-analysis had 2 experiments excluded.
HR- and MSR-based summary statistics performed similarly at random-effects meta-analysis. Their ability to detect a treatment effect was comparable, with sensitivity around 70% of meta-analysis of 20 experiments for each summary statistic and close to 100% for those including 50 or more experiments (Fig. 3A). I2 values were consistently low for HR-based meta-analyses, with few values returned over 25% (Fig. 3B), in keeping with the more conservative SE estimations discussed above. Conversely, I2 was consistently between 60 and 65% throughout the range of meta-analysis sizes for MSR data (Fig. 3C). The I2 was higher for MSR-based approaches in every instance than HR. There was a fairly consistent global efficacy estimation across both datasets, with variance of estimates slightly lower for the MSR meta-analyses (Fig. 3D, E).
We compared the ability of meta-regression to detect the predictive value of covariates on treatment outcome for both MSR- and HR-based meta-analysis, at both univariate and multivariate stages. At every stage, alpha was set to 0.005 for univariate meta-regressions to account for multiple testing, and 0.05 for multivariate meta-regressions. For univariate meta-regression, sensitivity was overall relatively low. The power to detect even the stronger associations (for example, with Var1 and Bin1) was below 50% in meta-analyses of 100 studies or less for each dataset. However, sensitivity increased to over 80% for strong associations at 200 studies and moderate associations (e.g. Var2, Var3, Bin2) at 1000 studies. There was no major advantage of one summary statistic over the other in terms of sensitivity, although MSR-based meta-analysis slightly outperformed HR-based meta-analysis in every case. The type I error rate was maintained around 0.05 for each of the control variables throughout the range of meta-analysis sizes (Var5, Bin3, Cont2; Fig. 4A).
Finally, we undertook the same assessment but using multivariate meta-regression. Again, sensitivity was relatively low but not dissimilar to that found at univariate meta-regression. Power was limited to below 50% (α 0.05) for even strong associations when meta-analysis size was 100 or less, but a power of 80% or more was seen for strong and moderate associations for larger meta-analyses in the same manner as the univariate meta-regression (Fig. 4B). With α set as above, multivariate meta-regression appeared to confer a sensitivity advantage over univariate meta-regression (see Supplementary Material 7). Indeed, if α is increased to 0.05 for both strategies, there were no appreciable differences in sensitivity between to two techniques for MSR data (data not shown).
This section shows that, on repeated meta-analysis and meta-regression, the power for MSR to detect associations is equivalent if not superior to HR for small studies.
Publication bias inclusion
We recreated the large dataset (n = 1.6m individuals, 100,000 experiments) but introduced a file drawer effect to simulate publication biases of varying strengths by randomly discarding 0%, 25%, 50%, 75% and 100% of experiments for which there was no apparent treatment effect on log-rank testing. Of the 100,000 experiments, only 17,076 (17.1%) returned a significant log-rank test statistic. Thus, the sample size reduced for each dataset as the influence of the file drawer effect increased. That being said, meta-analyses were only included in the analysis if their size was at least 90% of that intended (18 for group size 20, 900 for group size 1000, etc.).
Unsurprisingly, the sensitivity of meta-analysis to detect treatment effect increased with increasing file drawer effect (Fig. 5A) despite a dramatic reduction in sample size. Because global efficacy estimates appeared not to vary greatly between different meta-analysis sizes, we compared global efficacy estimates and I2 for meta-analyses of size 1000 only as these returned estimates with the greatest precision. There was a dramatic increase in the perceived global efficacy estimates as publication bias influence increased, with median survival ratios of 2.01 (lnMSR = 0.702) observed in studies in which all non-significant experiments were discarded and 1.47 (lnMSR = 0.387) with a file drawer effect of 75% (Fig. 5B). There was no clear association between the file drawer effect and between-study heterogeneity (Fig. 5C).
We compared the performance of multivariate meta-regression of MSR data compromised by varying file drawer effects. While there was no major correlation between file drawer influence and meta-regression sensitivity, there was a trend for more biased datasets to suggest slightly higher power for detection of associations of any strength, without a clear increase in the type I error rate (Fig. 6).
This section gives an estimation of the true influence of publication bias, which cannot be accurately measured from real data (as the problem is missing/irretrievable data).