### Survival simulation

Our simulation returned 1000 studies with median 8 animals per group, *IQR* 7–10, range 2–23 (see Supplementary Material 2). During the simulated follow-up period (arbitrarily scaled to 50 days), 15,804 (98.8%) deaths occurred, with 196 (1.2%) surviving the duration of the experiment. The overall median survival was 8.58 days.

Cox regression of the individual data (via *stcox* command) suggested a significant treatment effect (*HR* 1.50 ± 0.024, *Z* = 24.8, *p* < 0.001), with a corresponding median survival ratio (*MSR*) of 1.27 (Fig. 1). Furthermore, the influence of all 7 variables intended to impact on survival was demonstrable on a multivariate Cox regression. Kaplan-Meier curves visualising survival stratified by each categorical variable group are shown in Fig. 1.

The *stcox* function successfully generated *HR* estimates for all 1000 experiments in the first simulation, with a median of 1.52 (*IQR* 1.02–2.62), although a small percentage returned extreme *HR*s favouring either treatment (11/1000, 1.1%) or control (1/1000, 0.1%) with correspondingly large standard error (see Supplementary Material 3 for examples). These were not excluded as their weighting would be diminished at meta-analysis. The median *MSR* for the dataset was 1.25 (*IQR* 0.97–1.70)—there were no extreme outliers, range 0.423–6.27. On log-transforming each statistic prior to application to meta-analysis, there was a modest correlation between ln*HR* and ln*MSR* (*r* = 0.37, Fig. 2A). The same direction of treatment effect was suggested by both summary statistics in 840/1000 (84%) of instances; opposite treatment effects were suggested in the remaining 160 (16%). Typically, in these instances, the efficacy estimates were modest for both measures and differing polarities could be accounted for by non-significant treatment effect or crossing Kaplan-Meier curves (see Supplementary Material 4).

The measure of error used for ln*HR* was the standard error generated from the *stcox* estimation. As *MSR* measurement does not inherently produce a measure of error, this is estimated using the number of animals in the experiment as a surrogate. The total number of individuals in the experiment is used in place of inverse variance as a meta-analysis weighting factor, so SE for ln*MSR* is thus estimated using 1/√*n*. There was a linear correlation between se_ln*HR* and se_ln*MSR* although standard errors were larger and of greater range for *HR* data (Fig. 2B). Correspondingly, there was a correlation in fixed-effects weighting with weighting values being larger for *MSR* data than for *HR*, and absolute values much higher for *MSR* data than *HR* (Fig. 2C). Consequently, the *τ*^{2} estimate for the MSR-based meta-analysis was 0.111, compared with 0.0626 for the *HR* approach—meaning that random-effects weighting was more consistent in the *MSR* meta-analysis (Fig. 2D).

Meta-analysis of *HR* data suggested a significant treatment effect with pooled *HR* of 1.50 (*95% CI* 1.44–1.56; *t* = 19.3, *p* < 0.001). Similarly, there was pooled *MSR* of 1.29 (*95% CI* 1.26–1.32; *t* = 19.0, *p* < 0.001). The *I*^{2} values were high for *MSR* data (63.7%) and low for *HR* data (23.5%). On univariate meta-regression, a significant predictive effect of 6 variables was identified using *HR* (Var1, Var2, Var3, Bin1, Bin2) and of 5 using *MSR* (Var1, Var2, Var3, Bin1, Bin2, Cont1). Similarly, multivariate meta-regression of these 1000 studies revealed the predictive value of 5 variables for both *HR* and *MSR* (Var1, Var2, Var3, Bin1, Bin2).

To summarise, we have found strong correlations between the *HR* and *MSR* summary statistics as well as their performance in a single large meta-analysis.

### Meta-analysis and meta-regression power assessment

The simulation was repeated for 100,000 experiments containing 1.6m individuals in order to allow for an assessment of the sensitivity of each approach by repeated meta-analyses. This was done using the same parameters as the first simulation, except for the number of individuals and experiments. The median study group size was 8 (*IQR* 7–10). Death occurred for 98.7% of individuals during the experiment and the overall median survival was 8.39 days.

Cox regression again suggested a significant influence of treatment, with *HR* 1.46 ± 0.0237, *Z* = 234 and *p* < 0.001. Median survival was 9.42 days in the treatment group and 7.55 in the control, giving a *MSR* of 1.25. Similarly, Cox regression suggested a comparable influence of each predictive variable on survival outcome to those in the first dataset and no influence of the control variables (see Supplementary Material 5).

The iteration failed to converge in 34 instances (0.034%) of *HR* estimation and so these experiments were excluded from the remainder of the simulation. These experiments mostly had exceptionally small sample sizes (see Supplementary Material 6 for examples). Whenever a meta-analysis had a study excluded, it was still treated as if it were its original size (that is, of size 20 experiments instead of 19 or 100 instead of 99). There were no instances where a single meta-analysis had 2 experiments excluded.

*HR*- and *MSR*-based summary statistics performed similarly at random-effects meta-analysis. Their ability to detect a treatment effect was comparable, with sensitivity around 70% of meta-analysis of 20 experiments for each summary statistic and close to 100% for those including 50 or more experiments (Fig. 3A). *I*^{2} values were consistently low for *HR*-based meta-analyses, with few values returned over 25% (Fig. 3B), in keeping with the more conservative SE estimations discussed above. Conversely, *I*^{2} was consistently between 60 and 65% throughout the range of meta-analysis sizes for *MSR* data (Fig. 3C). The *I*^{2} was higher for *MSR*-based approaches in every instance than HR. There was a fairly consistent global efficacy estimation across both datasets, with variance of estimates slightly lower for the *MSR* meta-analyses (Fig. 3D, E).

We compared the ability of meta-regression to detect the predictive value of covariates on treatment outcome for both *MSR*- and *HR*-based meta-analysis, at both univariate and multivariate stages. At every stage, alpha was set to 0.005 for univariate meta-regressions to account for multiple testing, and 0.05 for multivariate meta-regressions. For univariate meta-regression, sensitivity was overall relatively low. The power to detect even the stronger associations (for example, with Var1 and Bin1) was below 50% in meta-analyses of 100 studies or less for each dataset. However, sensitivity increased to over 80% for strong associations at 200 studies and moderate associations (e.g. Var2, Var3, Bin2) at 1000 studies. There was no major advantage of one summary statistic over the other in terms of sensitivity, although *MSR*-based meta-analysis slightly outperformed *HR*-based meta-analysis in every case. The type I error rate was maintained around 0.05 for each of the control variables throughout the range of meta-analysis sizes (Var5, Bin3, Cont2; Fig. 4A).

Finally, we undertook the same assessment but using multivariate meta-regression. Again, sensitivity was relatively low but not dissimilar to that found at univariate meta-regression. Power was limited to below 50% (*α* 0.05) for even strong associations when meta-analysis size was 100 or less, but a power of 80% or more was seen for strong and moderate associations for larger meta-analyses in the same manner as the univariate meta-regression (Fig. 4B). With *α* set as above, multivariate meta-regression appeared to confer a sensitivity advantage over univariate meta-regression (see Supplementary Material 7). Indeed, if *α* is increased to 0.05 for both strategies, there were no appreciable differences in sensitivity between to two techniques for *MSR* data (data not shown).

This section shows that, on repeated meta-analysis and meta-regression, the power for *MSR* to detect associations is equivalent if not superior to *HR* for small studies.

### Publication bias inclusion

We recreated the large dataset (*n* = 1.6m individuals, 100,000 experiments) but introduced a file drawer effect to simulate publication biases of varying strengths by randomly discarding 0%, 25%, 50%, 75% and 100% of experiments for which there was no apparent treatment effect on log-rank testing. Of the 100,000 experiments, only 17,076 (17.1%) returned a significant log-rank test statistic. Thus, the sample size reduced for each dataset as the influence of the file drawer effect increased. That being said, meta-analyses were only included in the analysis if their size was at least 90% of that intended (18 for group size 20, 900 for group size 1000, etc.).

Unsurprisingly, the sensitivity of meta-analysis to detect treatment effect increased with increasing file drawer effect (Fig. 5A) despite a dramatic reduction in sample size. Because global efficacy estimates appeared not to vary greatly between different meta-analysis sizes, we compared global efficacy estimates and *I*^{2} for meta-analyses of size 1000 only as these returned estimates with the greatest precision. There was a dramatic increase in the perceived global efficacy estimates as publication bias influence increased, with median survival ratios of 2.01 (ln*MSR* = 0.702) observed in studies in which all non-significant experiments were discarded and 1.47 (ln*MSR* = 0.387) with a file drawer effect of 75% (Fig. 5B). There was no clear association between the file drawer effect and between-study heterogeneity (Fig. 5C).

We compared the performance of multivariate meta-regression of *MSR* data compromised by varying file drawer effects. While there was no major correlation between file drawer influence and meta-regression sensitivity, there was a trend for more biased datasets to suggest slightly higher power for detection of associations of any strength, without a clear increase in the type I error rate (Fig. 6).

This section gives an estimation of the true influence of publication bias, which cannot be accurately measured from real data (as the problem is missing/irretrievable data).