Approximately 50% of systematic reviews use statistical techniques to combine study results and most of these assess consistency across the studies [17]. Several studies report that tests of presence of heterogeneity are frequently performed in meta-analyses, that they are often statistically significant, and that a variety of methods are used to explore heterogeneity, including subgroup analyses and meta-regression [18–21]. Meta-regression may produce spurious findings when performed on a small number of studies, or when investigating multiple covariates [4].

Our study found that using backwards stepwise meta-regression with large sample methods or with permutation-based resampling resulted in identical final models. To the best of our knowledge this is the first empirical test of permutation in meta-regression modeling using an elimination procedure. The results are surprising given that we only had 5 to 6.25 trials per covariate in all of the initial models, which is well below the recommended 10 trials per variable [9–12]. Given that this rule of thumb was derived from simulations for logistic and Cox modeling it is possible that these results do not reliably apply to meta-regression procedures on continuous outcome variables. For example, a simulation study on using linear and logistic regression resulted in a rule of thumb being proposed that, to avoid overfitting when using forward stepwise selection procedures, an event per variable of greater than 4 is required [22–25]. If we extend this observation to linear meta-regression with a backwards stepwise selection procedure then our results seem to make empirical sense. Furthermore, we recommend that more simulations be performed to test the ten trials per variable rule of thumb.

We also found that the *P* values obtained using permutation tests are more conservative, or larger, than *P* values obtained using standard meta-regression methods. Specifically, the *P* values for significant covariates obtained with stepwise meta-regression were larger with permutation 78% of the time and identical 22% of the time compared with using standard large sample methods of obtaining *P* values. This finding is particularly important when *P* values are near a set level of statistical significance (for example, 0.05 or 0.01), since any increase in the *P* value will render the result non-significant. This finding provides empirical evidence that permutation tests increase *P* values and that permutation can be used in meta-analyses especially when examining the effects of multiple covariates, when faced with a relatively small number of included primary studies, or when a large amount of statistical heterogeneity is present. However, these results do not indicate that permutation-based resampling protects against associations arising due to ecological bias. One must always be aware that some associations found at the group level, in this case the study level, may not apply at the individual level.

Our findings are similar to those of Higgins and Thompson [4] who found that when examining multiple covariates in a random effects analysis, permutation resulted in generally larger *P* values, but not in all instances. For example, they found that for the covariate 'intention to treat analysis' the *P* value obtained by meta-regression was larger than that found with permutation. We found that out of nine significant covariates the *P* values for seven of these were larger in the permutation tests. In the other two instances the *P* values were identical. It is possible that the relatively low heterogeneity found in the latter final model (45.8%), or the relative normality of the distribution of effects in the final model resulted in this finding. It is possible that the level of statistical heterogeneity plays a larger role than the number of trials in meta-regression. This hypothesis remains to be tested.

Meta-regression may result in spurious findings with either multiple covariates, with a small number of primary studies [4, 25–28] or when there is a large magnitude of statistical heterogeneity. Permutation tests quell *P* values when exploring heterogeneity in these circumstances and will lead to more conservative probability estimates. In some cases the *P* value may cross over to non-significance. This finding is important in meta-analytic research especially since meta-analyses often include a small number of primary studies [4, 15], included studies often fail to report important information for these analyses [19], and they often have significant statistical heterogeneity [28–30]. The ability to accurately explore the reasons for heterogeneity and the influence of specific covariates could create increasingly specific and clinically relevant findings in meta-analyses and lead to valid hypothesis generation for future clinical trials. This could potentially save many resources including time, healthcare expenditure, and funding allocation. In agreement with previous recommendations [4] it is advised that permutation tests be used in all meta-analyses that include a small number of clinical trials per covariate (five or less to ten) or that have considerable or substantial heterogeneity.

This study has several strengths. First, we included a large number of primary studies and crosschecked the data extractions. By including a large number of studies we have performed an empirical test of permutation, which goes beyond previous simulations [4]. Next, we used a set of covariates that have empirical and theoretical relationships with our outcome variables. For example, it is well known that adequate randomization sequence generation and allocation concealment are important predictors of effect size [2, 3, 29, 30]. Also, it follows from basic pharmacokinetics that differing doses of the active herbal medicine or its active constituent predict clinical responses [10]. An additional strength of this study was that we ran 10,000 iterations in the permutation tests. Given that permutation tests are random processes, a larger number of permutations results in very similar findings with each additional permutation using the same variable. Therefore, our permutation test *P* values are likely robust. Another strength of this study is that it extends the statistical techniques that can be used in meta-analyses, giving reviewers an empirically validated method when analyzing the influence of covariates.

A drawback of this study is that it may not include all RCTs examining the herbal medicine outcome pairings chosen. That is, the treatment effect estimates presented in Table 2 may not be 'true' estimates of the effect across all published trials. We did not set out to perform a comprehensive meta-analysis on each pairing or to describe the actual or true extent/degree of bias or influence for certain covariates on summary effect estimates. Our objective was to explore the difference in *P* values obtained from standard meta-regression and permutation tests on a sample of trials for several covariates. This project was in effect an empirical exercise comparing two separate statistical techniques for arriving at *P* values. Another potential drawback of this study was that at no point did the permutation test change a significant *P* value to a non-significant one. Even though we did not see this, this would be expected since in most instances the permutation *P* value exceeded the meta-regression values. It is conceivable that such *P* value increases would result in a marginally significant *P* value obtained with meta-regression to become non-significant with permutation [4]. Also, though *P* values are often used to determine the existence of covariate-based effect modification, one should be certain to complement *P* values with actual differences in the effects estimates to determine if changes in the effect are clinically significant. Further research could build upon the intersection of statistical and clinical significance in effect modification. Finally, the last drawback of this research was that data from eight trials for the SJW depression pairing were extracted from Cochrane reviews [31, 32]. Therefore, any errors in extraction in the Cochrane review will result in errors in our extractions and the resulting data analyses we present.