Quantifying heterogeneity in individual participant data meta-analysis with binary outcomes

Background In meta-analyses (MA), effect estimates that are pooled together will often be heterogeneous. Determining how substantial heterogeneity is is an important aspect of MA. Method We consider how best to quantify heterogeneity in the context of individual participant data meta-analysis (IPD-MA) of binary data. Both two- and one-stage approaches are evaluated via simulation study. We consider conventional I 2 and R 2 statistics estimated via a two-stage approach and R 2 estimated via a one-stage approach. We propose a simulation-based intraclass correlation coefficient (ICC) adapted from Goldstein et al. to estimate the I 2, from the one-stage approach. Results Results show that when there is no effect modification, the estimated I 2 from the two-stage model is underestimated, while in the one-stage model, it is overestimated. In the presence of effect modification, the estimated I 2 from the one-stage model has better performance than that from the two-stage model when the prevalence of the outcome is high. The I 2 from the two-stage model is less sensitive to the strength of effect modification when the number of studies is large and prevalence is low. Conclusions The simulation-based I 2 based on a one-stage approach has better performance than the conventional I 2 based on a two-stage approach when there is strong effect modification with high prevalence. Electronic supplementary material The online version of this article (doi:10.1186/s13643-017-0630-4) contains supplementary material, which is available to authorized users.

Heterogeneity of effect estimates is an important consideration in both AD-MA and IPD-MA. Heterogeneity exists if the true effects vary across studies more than would be expected by chance alone. The estimated interstudy variance (τ 2 ) of the parameter of interest is the most direct measure of heterogeneity, but interpretation, particularly deciding what might be a problematic level of heterogeneity, is difficult, despite some practical suggestions [4]. The I 2 , originally proposed by Higgins and Thompson, meets three important criteria for any measure of heterogeneity: it monotonically increases with between-study variance; it is not varied by changing the scale; and it is not affected by the number of studies [5]. Importantly, despite some limitations [4], the I 2 remains the most often reported measure of heterogeneity and is easily interpretable, appealing to clinicians.
There are two approaches to analyze the data from IPD-MA: the two-stage approach and the one-stage approach.
In the two-stage approach, each study is analyzed separately, then standard meta-analytic techniques are applied, and heterogeneity may be quantified by usual methods. Alternatively, in the one-stage approach, a mixed model is fit and the data is analyzed altogether, accounting for the correlation that may exist between subjects in the same study and allowing the estimated effect to vary across studies. A review of statistical methods used in IPD-MA of binary outcomes found that most do not report any measure of heterogeneity [6]. While some measures of heterogeneity are easily obtained from a one-stage model, the I 2 is not. Our objective in this work was to consider various approaches to quantifying heterogeneity in IPD-MA of binary outcomes analyzed via the one-stage approach. We propose a method to obtain an I 2 from a one-stage model and evaluate it and other possible measures via simulation study.

Metrics of heterogeneity in IPD-MA with binary outcomes
In this section, we describe various measures of heterogeneity that may be used. Here, we consider that the primary analysis is a one-stage analysis of IPD-MA of dichotomous outcome data. Below, we describe four possible measures of heterogeneity: (1) the conventional I 2 from the corresponding two-stage analysis; (2) the R from the corresponding two-stage analysis; (3) a new metric: the I 2 from the one-stage approach; and (4) the R from the one-stage approach.

Between-study variance (τ 2 )
The between-study variance, τ 2 , quantifies the heterogeneity in IPD-MA directly. A large value ofτ 2 indicates that heterogeneity exists among the studies. However, the τ 2 is not ideal, since interpretation is difficult: there is no standard criteria to determine the level of heterogeneity (low, moderate, substantial), because the range is from 0 to ∞ [5,7]. All other approaches to quantify heterogeneity rely on τ 2 .
We might estimate the τ 2 via the two-stage or the one-stage approach. For the two-stage approach, we estimate τ 2 two−stage via the method described by DerSimonian, Laird, and Whitehead [7][8][9]: where N is the number studies,ω i is the reciprocal of estimated within-study variance, and Q represents Cochran's heterogeneity statistic [5,10,11]. A two-stage analysis proceeds as follows. Consider a MA of a binary outcome in N studies. In the first stage, we fit the logistic regressions in each of the N studies: where p j is the true response probability for the positive result of the jth individual in this study, β 0 represents the intercept, and x j indicates their treatment status. This model could be expanded to include effect modifiers.
In the first stage, we obtainβ 1i the estimated log odds ratio in study i for i = 1, 2, .., N [12], and the variance of the estimated log odds ratio (var(β 1i )) for each one of the N studies.
In the second stage, we consider: where τ 2 β 1 (τ 2 two−stage ) represents the respective degree of heterogeneity between studies [12]. Here, we assume the covariance between the parameter estimates (β 0i and β 1i ) are equal to 0, which means that we pool the treatmentoutcome associations (β 1i ) together [12]. This is similar to the classic DerSimonian and Laird random-effects model [8,13] and allows us to obtain an estimate of the betweenstudy variance τ 2 two−stage [12], as in Eq. 1. For the one-stage approach, with binary data, we may estimate the τ 2 one−stage from a generalized linear mixed model (GLMM) [14][15][16]. Under the one-stage randomeffects model, for each study, a study-specific intercept and treatment effect may be estimated. The study-specific intercept and treatment effects are assumed to come from a bivariate normal distribution [3,17,18]. Considering again a MA of a binary outcome in N studies: where p ij represents the true response probability for the positive result of the jth individual in the ith study and x ij indicates their treatment status. β 1 is the parameter of interest, which represents the pooled log odds ratio, and τ 2 1 is the between-study variance (τ 2 one−stage ).

I 2 statistic
Using a two-stage approach, consider a MA of N studies for the parameter of interest, called θ. Under the assumption that the estimated sampling variances are known and equal for all studies (σ 2 1 = σ 2 2 = σ 2 3 = ... = σ 2 N = σ 2 = 1/ω i ), Higgins and Thompson [5] defined a measurement function I 2 for quantifying the unexplained heterogene- They proposed to estimate I 2 two−stage as [5]: wherê While the I 2 is usually presented as a percentage varying from 0 to 100%, we present it as a proportion varying from 0 to 1.
In clustered data analyses, the I 2 is very similar to the intraclass correlation coefficient (ICC) [5,19]. The ICC is the ratio of the between-cluster variance to the total variance in the outcome [20]. It provides a quantitative measure of the amount of heterogeneity across clusters [14]. With binary data, estimating the ICC from a GLMM is possible, though more complicated [14][15][16]. Several measures have been proposed as estimators of the ICC for binary data [14,21,22], though none have been evaluated as measures of inter-study heterogeneity for IPD-MA. Goldstein et al. proposed a simulation-based approach that relies on partitioning the variation in the multilevel model to estimate an ICC for binary outcomes [22]. We propose to adapt this ICC estimator to estimate the I 2 in a one-stage IPD-MA. The algorithm is as follows: Step 1 Fit a random effects model to the data by using a GLMM, and adjust for possible effect modifiers if desired.
Step 2 Simulate a large number (e.g. m = 5000) of values from a normal distribution (e.g. Eq. 5), using the estimated covariance matrix from the multilevel random effect logistic regression fitted in Step 1. We denote these as μ 0,ij , μ 1,ij .
Step 3 Using the fitted model from Step 1, we estimate the log odds ratio for each subject (ν 1,ij =β 1 + μ 1,ij ) in the dataset, and then estimate the variance as ν 1 = V (ν 1,ij ).
Step 4 Estimate p ij by using the fitted model (from Step 1) and simulated random effect values (from Step 2).
• replacing (μ 0i , μ 1i ) by (μ 0,ij , μ 1,ij ) for each subject • plugging in the fixed effect estimator from the fitted model and the covariates from the dataset • taking the inverse logit, we obtain thep ij for each individual Step 5 Using the results from Step 4, it is easy to deduce the variance of the estimated log odds ratio via the Delta method, ν 2,ij = 1 np ij (1−p ij ) , where n is the average number of subjects among all of the studies. Finally, we obtain ν 2 = E(ν 2,ij ).
Step 6 The I 2 one-stage is now estimated aŝ

R statistic
The R statistic is the square root of the ratio of the variance of the summary statistic from the random-effects model divided by the variance of the summary statistic from the fixed-effects model. It quantifies the inflation of the confidence interval in the presence of inter-study heterogeneity [5]. If the estimated value of R is close to 1, then inference from the random-and fixed-effects models are similar [5]. However, unexplained between-study heterogeneity may exist when the estimate of R is greater than 1. Interpretation of R is difficult, for the same reasons as for τ 2 .
For the two-stage approach, we may estimate R two−stage asR where se(β 1F ) is the standard error of the estimated pooled log odds ratio in the fixed-effects model and se(β 1R ) is the standard error of the estimated pooled log odds ratio in the random-effects model (Eq. 3).
For the one-stage approach, we may fit a GLMM with random intercept and fixed slope. The standard error of the estimated β 1 (pooled log odds ratio) from that model with fixed intercept may be denoted as se(β 1F ). We denote the standard error of the estimated β 1 from model 4 as se(β 1R ). The estimated R one−stage from a one-stage model may then be estimated using Eq. 9.

Methods
We use simulations to investigate the performance of (i) conventional I 2 and R 2 based on a two-stage approach and (ii) simulation-based I 2 and R 2 based on a one-stage approach. We generated datasets that consisted of three variables: the binary treatment status, a binary effect modifier, and a binary outcome. Each combination of data generation parameters was used to generate 1000 datasets. We considered 84 distinct data generation scenarios (see Table 1).

Data generation details
The treatment variable is the covariate of primary interest in the IPD-MA; the effect modifier changes the effect of treatment on the outcome when present.

The number of studies and subjects
The number of studies in each dataset was given by N and was set to 15 or 30. The number of the subjects within each study (n i , i = 1, ..., N) followed a log-normal distribution, LN(σ 2 lg = 1.5 2 , κ = 10), that was truncated at 20 and 2000 (to avoid very small and large studies), and rounded to the nearest integer value. True between-study variance (τ 2 ) 0.5, 1, 1.5 With strong effect modification (β w , β xw ) (2, 5)

Treatment status
The prevalence of treatment for each study was generated from a uniform distribution, p x i ∼ U(θ lower = 0.4, θ upper = 0.6). Using these study specific treatment prevalences, we generated n i random variables from a Bernoulli distribution for each subject, as x ij ∼ Bernoulli(p x i ). These x ij represented the treatment status of subject j in study i.

Effect modifier
We generated a binary effect modifier. First, N studyspecific effect modifier prevalences were generated as p w i ∼ Uniform(θ w lower = 0.1, θ w upper = 0.9). Then, using these probabilities, we obtained the effect modifiers from the Bernoulli distributions, w ij ∼ Bernoulli(p w i ).

Outcomes
We generated the outcome y ij based on the generated values of the treatment and effect modifier, as well as the regression coefficients that described the association of each of these with the binary outcome, using the following equation: β 0 , the fixed intercept, was set based on the given value of prevalence p pre , where β 0 = log p pre 1−p pre . The prevalence (p pre ) was set at 0.3 or 0.7. The random intercepts for individuals within each study were μ 0,i ∼ Normal(0, σ 2 μ 0 ), where σ 2 μ 0 was given. The true pooled treatment effect was β 1 . Furthermore, μ 1,i was the study-specific random effect for the slope, which followed a normal distribution with zero mean and variance τ 2 . β w and (β w + β xw ) were the log odds ratio of the effect modifier in untreated and treated individuals, respectively. The parameter value used to generate the random intercepts (σ 2 μ 0 ) was given by 1 and the fixed interested β 1 was given by log(1.3). The parameter used to generate the random slopes (τ 2 ) was set to 0.5, 1, or 1.5.
Using Eq. 10, we obtained p ij . Participant level probabilities of outcome were calculated as π ij = e p ij 1−e p ij , then y ij was generated from a Bernoulli(π ij ) distribution.

Datasets
We contemplated two scenarios, including (i) no effect modification and (ii) effect modification, by varying the data generation parameters (β w , β xw ).
Our rationale was to evaluate each measure of heterogeneity according to the following: (i) Did the measures of heterogeneity increase with increasing τ 2 in datasets that were generated such that there was no effect modification? (ii) Did the measures of heterogeneity decrease when the effect modifier and an interaction term between treatment and the effect modifier were included in the model when effect modification was present?
Furthermore, we investigate whether the simulationbased I 2 satisfied the criteria proposed by Higgins et al.: (i) monotonically increasing with increasing between-study variance; (ii) not varied by changing scale; and (iii) not affected by the number of studies [5].

IPD-MA with no effect modification
To generate datasets with no effect modification, we set β w and β xw to zero.

IDP-MA with effect modification
We varied β w and β xw to generate datasets with weak or strong effect modification, as presented in Table 1.

Data analysis
For each generated dataset, we considered both two-stage and one-stage approaches to quantifying heterogeneity.

Two-stage approach
In this approach, each study is analyzed separately then pooled together using methods described in "Betweenstudy variance (τ 2 )" section [12].
In the first stage, we considered two logistic regression models for each study in the dataset: (i) a crude model (logit(p j ) = β 0 + β 1 x j ) and (ii) an effect modification model ( logit(p j ) = β 0 +β 1 x j +β 2 w j +β 3 x j w j ), where p j was the true response probability for the positive result of the jth individual in this study, β 0 represented the intercept, x j indicated the treatment status, and w j was the effect modifier for the jth individual in this study.
When the IPD were generated without effect modification, we fitted the crude model to estimate the pooled treatment effect. When the IPD were generated with effect modification, we considered a crude model and a model that included the effect modifier, the treatment, and an interaction term between the effect modifier and the treatment to estimate the pooled treatment effect.
In the first stage, we estimated the log odds ratioβ 1i (i = 1, ..., N) from each study. In the second stage, we pooled these together via the DerSimonian and Laird method and, estimated the between-study variance (τ 2 ) and the pooled treatment effect. We also applied the methods described in "I 2 statistic" and "R statistic" sections to estimate the I 2 two−stage and R 2 two−stage for quantifying the heterogeneity in a two-stage IPD-MA.

One-stage approach
For each generated dataset, we fitted a logistic regression with random intercept and slope for studies, estimated via adaptive Gauss-Hermite quadrature. Similar to the twostage approach, we considered the following models: In all models, p ij was the true response probability of disease for the jth individual in the ith study, x ij indicated treatment status, and w ij represented the effect modifier. The random intercept and slope were μ 0i , μ 1i , such that: whereτ 1 was the estimated between-study variance of the treatment effect. We estimated I 2 one−stage via the simulation-based method, and R 2 one−stage was computed by the ratio of the estimated variance of β 1 under a random slopes model and a fixed slope model with random intercepts, as described in the "I 2 statistic" and "R statistic" sections.

Metrics and performance
We collected I 2 , R 2 , and τ 2 as estimated from both twostage and one-stage approaches in each generated dataset for all combinations of data generation parameters. We estimated the median and interquartile range (IQR) from 1000 datasets. If the dataset was generated with effect modification, then the median and IQR of the ratio of the I 2 as estimated from a crude model to that estimated from a model that included the effect modifier and the interaction between the effect modifier and treatment status I 2 emod I 2 crude was collected. Similar measures were reported for R 2 and τ 2 . We collected the ratios because we wanted to investigate the differences inÎ 2 ,R 2 , andτ 2 before and after taking effect modification into account. All statistical analysis were carried out in R, version 3.2.3 [23]. Figure 1 shows the estimated between-study varianceτ 2 from both the two-stage (dashed line) and the one-stage (dotted line) approaches versus the true between-study variance, τ 2 (solid line). As the true τ 2 increased, the estimatedτ 2 from both approaches also increased. Compared with the estimatedτ 2 from a two-stage model, theτ 2 from the one-stage model increased more rapidly. The twostage model always underestimatedτ 2 . On the other hand, the one-stage approach very slightly underestimatedτ 2 when the true τ 2 was small, and it overestimatedτ 2 when the true τ 2 was larger than 1.3. Figure 2 shows the conventional I 2 from the twostage model (dashed line) and the simulation-based I 2 from one-stage model (dotted line) versus the true between-study variance τ 2 . Both measures increased, then leveled off as the true between-study variance increased. Table 2 presents the median value and IQR of theÎ 2 andR 2 across 1000 datasets from the two-stage and onestage models for different combinations of data generation parameter values. The median estimated I 2 and R 2 from   a Please note that I 2 is presented here as a proportion varying from 0 to 1, rather than as a percentage prevalence from 30 to 70% did not affect the estimates of I 2 and R 2 via two-and one-stage models. Furthermore,τ 2 from the two-and one-stage approaches were similar for different prevalence and number of studies (Additional file 1: Table S1). Table 3 presents the median value and IQR of the ratio of I 2 and R 2 from a model that ignored the effect modifier to one that included the effect modifier and an interaction term between it and the treatment status across 1000 datasets from the two-stage and one-stage approaches with prevalence = 30%. Any measure of heterogeneity should be sensitive to changes in heterogeneity. If we did not account for effect modification when it existed, then heterogeneity might arise due to this effect modification [24]. Hence, if the ratio estimators reported in the Table 3 are less than 1, they indicate good sensitivity of the measure to changing heterogeneity.

With effect modification
When the strength of effect modification was weak, the ratio estimators for I 2 two−stage were well below 1, while the ratio estimators for I 2 one−stage were close to 1. When the strength of effect modification was moderate or strong, we found the ratio estimators for I 2 one−stage were below 1, suggesting the estimated I 2 one−stage was sensitive to changing heterogeneity. When the prevalence increased from 30 to 70%, in the two-stage model, almost all values of the ratio estimator for τ 2 were equal to 1 (Additional file 1: Table S3), as were most values of the ratio estimator for I 2 two−stage (Additional file 1: Table S2). The estimated R 2 from the two-stage model had similar performance. Conversely, most of the ratio estimators for I 2 one−stage were less than 1 when we fixed the prevalence to be 70% and varied other parameter values. However, the convergence rate

Median (IQR) was presented
We present the ratios of the measure estimated from a model that ignored the effect modifier to one that included the effect modifier and an interaction term between it and the treatment status a Effect modification was classified as weak when β w = 1, β xw = 1, as moderate when β w = 1, β xw = 3, and as strong when β w = 2, β xw = 5 for one-stage approach decreased as the strength of effect modification became stronger (data not shown). When the number of studies and prevalence were 30 and 30%, most of ratio estimators for I 2 two−stage were equal to 0.01. This occurred because the estimated τ 2 two−stage from the effect modification model was close to zero (Additional file 1: Table S3).
Furthermore, in Additional file 1: Table S3, the ratio estimators for τ 2 in the two-stage model were all less than or equal to 1. However, most of ratio estimators for τ 2 in the one-stage model were larger than 1.

Discussion
IPD-MA are the gold standard of meta-analytic approaches. While the primary objective of most IPD-MA is to estimate pooled treatment effects, quantifying inter-study heterogeneity of those effects is also an important goal. Most statisticians agree that a one-stage approach is the best and most flexible approach to use when analyzing data from IPD-MA. However, how best to quantify inter-study heterogeneity in that case is unclear [3,5,12], and most IPD-MA of binary outcomes do not report any measure of heterogeneity [6].
In this work, we considered using usual measures of heterogeneity based on two-stage approaches, as well as novel approaches based on a one-stage model. We evaluated both two-stage and one-stage approaches via simulation studies. In the two-stage approach, we used the usual I 2 and R 2 statistics proposed by Higgins et al. to measure heterogeneity [5]. In the one-stage approach, we adapted a simulation-based ICC proposed by Goldstein et al. to estimate the I 2 , as well as considering the R 2 based on the one-stage model.
Our results demonstrated that when there was no effect modification, the estimated τ 2 from the two-stage model was always underestimated. When using a onestage approach, the estimated τ 2 was underestimated when the true τ 2 was small, but overestimated when the true τ 2 was large. Correspondingly, we may assume that the estimated I 2 from the two-stage model was underestimated, whereas the simulation-based I 2 in the one-stage model was underestimated when inter-study heterogeneity was small and overestimated when it was large. Both the two-stage I 2 and one-stage I 2 increased as the true τ 2 increased.
Including a variable and the interaction of that variable and the treatment of interest when effect modification is present should decrease the estimated between-study heterogeneity. In the presence of weak effect modification, the estimated I 2 from the two-stage model that accounted for the effect modification was less than that from a model that did not. Nevertheless, the estimated I 2 from the one-stage approach that accounted for effect modification quantified heterogeneity well when the strength of effect modification was moderate or strong. The I 2 from the two-stage model was less sensitive to reflect the strength of effect modification when the number of studies was large and prevalence was low. Overall, this suggests that using the simulation-based I 2 based on one-stage model is preferable.
Differences between measures of heterogeneity in the two-stage and one-stage approaches might be due to real differences in the methods, or because slightly different models were used. In the one-stage approach, we only considered models that fit a random intercept and slope, while the two-stage approaches fit just a random slope. However, these are the approaches most commonly used [6].

Strengths of the work
We have proposed a simulation-based I 2 to use in onestage IPD-MA of binary outcomes. We have shown that this I 2 satisfies the conditions proposed by Higgins et al., for any measures quantifying heterogeneity, i.e., (i) the measurement function should monotonically increase with increasing between-study variance τ 2 and (ii) not be affected by the number of studies N [5]. Moreover, we have shown that the simnulation-based I 2 is sensitive to changes in heterogeneity.
When the outcome is binary, the within-study variance varies across the studies as between-study variance increases [7]. As a result, the assumption of equal estimated sampling variances across all studies, as in Higgins and Thompson's paper [5], does not hold, and Higgins's I 2 may be biased. For that reason, we would expect the simulation-based I 2 based on the one-stage approach to have better performance than the conventional I 2 based on the two-stage approach.
Using a heterogeneity measure based on the onestage model is also advantageous, because the one-stage approach allows investigation of patient-and study-level covariates, and the treatment effect can be adjusted for covariates at the participant-or study-level [18]. Moreover, the one-stage model allows MA of dose-response curves (e.g., allowing non-linearity), improves power for interactions [25,26], and allows better control of effect modification by patient-and study-level covariates than the two-stage approach [3,17,27].
While we investigated its performance for binary outcomes, using the ICC as an I 2 for continuous outcomes in the context of a mixed model would be possible, though to our knowledge has not been used like this.

Limitations
There are several limitations in this work. We only considered the ICC estimator proposed by Goldstein to estimate the I 2 in one-stage IPD-MA of binary outcomes. However, there are several other measures that have been proposed as estimators of the ICC for binary data [14,21,22]. Wu et al. discussed five different methods to estimate the ICC with binary outcomes: an analysis of variance (ANOVA) estimator, the Fleiss-Cuzick estimator, the Pearson estimator, an estimator based on generalized estimating equations (GEE), and an estimator from the random intercept logistic model [20]. These could be adapted to estimate I 2 in one-stage IPD-MA. On the other hand, the measure we have proposed is easy to estimate.
Moreover, GLMMs estimated via adaptive quadrature sometimes do not converge in the one-stage model [19]. Indeed, we observed a sometimes important rate of nonconvergence when the strength of effect modification was strong and the prevalence was high. Other estimation approaches such as penalized quasi-likelihood (PQL) or Bayesian multilevel models might be interesting to explore [28,29]. While convergence of PQL is more likely, estimates can be biased with few subjects per cluster, low event rates, or high inter-cluster variability [7,29,30].
For the two-stage approaches, we estimated τ 2 via the method of moments estimator of DerSimonian and Laird, despite more recent evidence suggesting that the Paule and Mandel estimator is preferred [31].
Finally, we invsestigated a finite number of data generation parameters. In particular, we considered datasets of 15 or 30 studies, whereas it may have been interesting to consider fewer (e.g., 5).

Conclusion
When a one-stage approach for IDP-MA of binary outcomes is preferred, heterogeneity should be quantified via the model estimated. In that case, we have proposed a simulation-based I 2 that performs as well or better than the conventional I 2 based on a two-stage approach. The R 2 based on the one-stage model also performed adequately but is more difficult to interpret.

Additional file
Additional file 1: Table S1. The Median (IQR) of the estimated τ 2 when no effect modification was present. Table S2. Sensitivity of heterogeneity measures to accounting for effect modification when prevalence of the outcome was 70%. Median (IQR) was presented.