Skip to main content


Table 8 Framework for assessing credibility of subgroup analyses

From: An approach to addressing subpopulation considerations in systematic reviews: the experience of reviewers supporting the U.S. Preventive Services Task Force

  Questions to consider for credibility assessment
Likelihood that subgroup effects are SPURIOUS MAIN DOMAIN: Was a statistical test for interaction performed and did it indicate effect modification? [24, 53]
The statistical test of subgroup-intervention effect interaction assesses whether the effect differs significantly between subgroups, rather than only assessing the significance of the intervention effect in one subgroup or the other [54]. If the p value for the test result is <0.05 (or a more stringent alpha), then the effects between subgroups are not the same [54]. If there are multiple subgroup-treatment effect interactions, further statistical analyses are required to confirm whether the effects are independent [54].
When was the subgroup-specific analysis specified?
Determine when the subgroup analyses were specified in the study [24, 54]. An a priori subgroup analysis is one that is planned and documented before examination of data, preferably in the study protocol, and ideally includes a hypothesized direction of effect. When reported, this information can often be found in the methods section of the article. Subgroup treatment effect interactions identified post hoc must be interpreted with caution. There are no statistical tests of significance that are considered reliable in this scenario [54].
Was the total number of subgroup analyses limited to a small number of clinically
important questions (i.e., <5)?
This is a study-specific factor, rather than a subgroup-specific one. Subgroup analyses should be limited to a small number of clinically important questions in each study, and ideally limited to the primary trial outcome [8, 54]. Sun et al. suggest there should be five or fewer subgroup hypotheses tested [24].
If conducting a large number of subgroup analyses, was the statistical significance threshold adjusted (e.g., using a lower p value than 0.05)?
This is a study-specific factor. Because the probability of a false positive result is high when a large number of subgroup analyses are conducted, studies can correct for the inflated false positive rate by adjusting the significance threshold for their interaction tests [55]. For example, if 10 tests are conducted, each one could use a 0.005 threshold; if 20 are conducted, each one could use a 0.0025 (these thresholds were calculated using 0.05/K, where K is the number of independent tests conducted; this equation ensures that the overall chances of a false positive result are no greater than 5%) [55].
Likelihood of CONFOUNDING of subgroup analysis MAIN DOMAIN: Was the subgroup analysis potentially confounded by another study variable?
In subgroup analyses in RCTs, the primary intervention is randomized but the secondary factors defining subgroups usually are not [56]. Controlling for confounding variables for the secondary factor that defines a particular subgroup is important when investigators are interested in intervening using the subgroup factor to increase intervention effect. This information may help judge the concern given to possible confounding.
Were the intervention arms comparable at baseline for the subgroup of interest?
For example, if the subgroup of interest is sex, the systematic reviewer should try to confirm that males in the intervention group were comparable to males in the control group. Similarly, females in the intervention group should be comparable to females in the control group. If the stratified intervention arms are not comparable at baseline, secondary factors affecting comparability could be confounding study variables [54].
Was the subgroup variable a characteristic specified at baseline (in contrast with after randomization)?
This ensures that the benefits of randomization are maintained throughout the duration of the study, and reduces the possibility of confounding [8]. The credibility of subgroup hypotheses based on post-randomization characteristics can be severely compromised, since any apparent difference in intervention effect could potentially be explained by the intervention itself or different prognostic characteristics in subgroups that emerge after randomization [57]. Analyses based on characteristics that emerge during follow-up violate the principles of randomization and are less valid [26].
Was the subgroup variable a stratification factor at randomization?
Randomization stratified for a priori subpopulations ensures comparable distribution of other characteristics, including potential confounding factors between subgroups on this factor [24, 54]. Stratified randomization ensures there is a separate randomization procedure within each subset of participants.
Likelihood of inadequate POWER to detect subgroup differences Was the trial powered to detect subgroup differences?
If important subgroup-intervention effect interactions are anticipated, trials should be powered to detect them reliably [18, 54]. If a trial is underpowered for the main outcomes of interest, it is almost never adequately powered for a subgroup analysis.
If a study did detect a difference in subgroup effect, then this domain would be assessed as very unlikely (i.e., that power was inadequate) because the power calculation, which was based on assumptions such as an estimate of the difference that might exist, is no longer very important after a significant difference has been revealed. If a study does not detect a difference, then it is very relevant to assess whether or not the study was underpowered.
To inform judgments made about the evidence, the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) Working Group suggests that systematic reviewers consider the optimal information size (OIS) threshold as an additional criterion for adequate precision. OIS is reached if the total number of patients included in a systematic review is the same or more than the number of patients generated by a conventional sample size calculation for a single adequately powered trial [58]. Another potential application of the OIS criterion could be to indicate potential power issues in important subgroup analyses.