A comparison of two assessment tools used in overviews of systematic reviews: ROBIS versus AMSTAR-2

Perry, R.; Whitmarsh, A.; Leach, V.; Davies, P.

doi:10.1186/s13643-021-01819-x

Methodology
Open access
Published: 25 October 2021

A comparison of two assessment tools used in overviews of systematic reviews: ROBIS versus AMSTAR-2

R. Perry ORCID: orcid.org/0000-0001-5874-3016¹,
A. Whitmarsh¹,
V. Leach² &
…
P. Davies^2,3

Systematic Reviews volume 10, Article number: 273 (2021) Cite this article

11k Accesses
57 Citations
17 Altmetric
Metrics details

Abstract

Background

AMSTAR-2 is a 16-item assessment tool to check the quality of a systematic review and establish whether the most important elements are reported. ROBIS is another assessment tool which was designed to evaluate the level of bias present within a systematic review. Our objective was to compare, contrast and establish both inter-rater reliability and usability of both tools as part of two overviews of systematic reviews. Strictly speaking, one tool assesses methodological quality (AMSTAR-2) and the other assesses risk of bias (ROBIS), but there is considerable overlap between the tools in terms of the signalling questions.

Methods

Three reviewers independently assessed 31 systematic reviews using both tools. The inter-rater reliability of all sub-sections using each instrument (AMSTAR-2 and ROBIS) was calculated using Gwet’s agreement coefficient (AC₁ for unweighted analysis and AC₂ for weighted analysis).

Results

Thirty-one systematic reviews were included. For AMSTAR-2, the median agreement for all questions was 0.61. Eight of the 16 AMSTAR-2 questions had substantial agreement or higher (> 0.61). For ROBIS, the median agreement for all questions was also 0.61. Eleven of the 24 ROBIS questions had substantial agreement or higher.

Conclusion

ROBIS is an effective tool for assessing risk of bias in systematic reviews and AMSTAR-2 is an effective tool at assessing quality. The median agreement between raters for both tools was identical (0.61). Reviews that included a meta-analysis were easier to rate with ROBIS; however, further developmental work could improve its use in reviews without a formal synthesis. AMSTAR-2 was more straightforward to use; however, more response options would be beneficial.

Peer Review reports

Background

Systematic reviews have become a fundamental part of evidence-based medicine; they are considered the highest form of evidence as they synthesise all available evidence on a given topic [1]. Many will also combine data to give an overall effect estimate using a meta-analysis. However, the quality and standard of reviews varies considerably. If this is not understood, or in some way established, the results of many reviews might be overstated. Quality assessment tools have been developed to assess such variation in standards.

One previously heavily cited tool is the Assessment of Multiple Systematic Reviews (AMSTAR) scale [2] which has been widely used since its development in 2007. This scale was shown to be both reliable and valid [3]. However, it came under criticism for some issues with its design. It was argued by Burda et al. [4] that AMSTAR was lacking in some key constructs, in particular, the confidence in the estimates of effect. It also lacks an item to assess subgroup and sensitivity analysis. Further criticisms include issues such as the inclusion of foreign language papers as “grey literature” and the idea that the items can often partially but not fully meet the criteria was highlighted. Also, each item was not weighted evenly and there is a lack of overall score, which became problematic when trying to compare scores. Thus, an upgraded version (AMSTAR-2) was developed in 2017. The new version promised to simplify the response categories, align the definition of research questions with the PICO (population, intervention, control group, outcome) framework, seek justification for the review authors’ selection of different study designs (randomised and non-randomised) and included numerical rating scales for inclusion in systematic reviews, seek reasons for exclusion of studies from the review, and determine whether the review authors had made a sufficiently detailed assessment of risk of bias for the included studies and whether risk of bias was considered adequately during statistical pooling and when interpreting the results [5].

A second novel assessment tool that has undergone rigorous development was published in 2016 (Risk of Bias in Systematic reviews [ROBIS [6]]). It aimed to provide a thorough and robust assessment of the level of bias within the systematic review.

Description of the assessment tools

Assessment of multiple systematic reviews (AMSTAR-2)

The main aim of the AMSTAR-2 is a tool to assess the methodological quality of the review. It is made up of 16 items in total and has simpler response categories than the original AMSTAR version. Some sections are considered by the authors to be critical domains, which can be used for determining an overall score (see Appendix, Table 12 for more information on the critical domains). AMSTAR-2 is intended for assessing effectiveness. The tool can also be applied to reviews of both randomised and non-randomised studies.

ROBIS tool

The main aim of the ROBIS tool is to evaluate the level of bias present within a systematic review. The tool is made up of three distinct phases. Firstly, there is an optional first phase to assess the applicability of the review to the research question of interest. The second phase is made up of 20-items within four main domains: study eligibility criteria, identification and selection of studies, data collection and study appraisal, synthesis and findings. This phase is to identify concerns about the review conduct. Each domain has signalling questions and ends with a judgement of concerns of each domain (low, high or unclear). There is also a third phase consisting of three signalling questions to enable an overall assessment of bias rating to be given. ROBIS has a wide application and is intended for assessing effectiveness, diagnostic test accuracy, prognosis and aetiology [6].

Previous research

Due to the novelty of both tools, there is limited available literature comparing them; however, some work has been recently published.

One review team [7, 8] compared all three tools (AMSTAR, AMSTAR-2 and ROBIS), applying them to reviews that reported both randomised and non-randomised trials. The inter-rater reliability between four raters’ across 30 systematic reviews was analysed. Minor differences were found between AMSTAR-2 and ROBIS in the assessment of systematic reviews including a mix of study type. On average, the inter-rater reliability (IRR) was higher for AMSTAR-2 compared to ROBIS. They assumed that scoring ROBIS would take more time in general, and it was always applied after AMSTAR-2, but in fact the mean time for scoring AMSTAR-2 was slightly higher than for ROBIS (18 vs. 16 min), with huge variation between the reviewers. They also reported that some signalling questions in ROBIS were judged to be very difficult to assess.

Aim

The overarching aim of our work is to add to the literature and make a further comparison of both assessment tools in two overviews of reviews. Our team had previously completed two overviews on complementary and alternative medicine (CAM) therapies for two hard-to-treat conditions. One overview evaluated systematic reviews of various CAM therapies for fibromyalgia (FM) [9], and the other evaluated systematic reviews of CAM therapies for infantile colic [10].

Objectives

Due to some of the challenges we had using both tools in our overview of reviews work, we planned a formal assessment of both tools by completing the following comparisons and evaluations:

1.
To compare the content of the tools
2.
To compare the percentage agreement (IRR)
3.
To assess the useability/user experience of both tools.

Methods

Two overviews of reviews were conducted by our team [9, 10]. The first reviewed CAM for fibromyalgia and assessed the included reviews using both the original AMSTAR tool [2] and ROBIS [6]. This review was published in 2016, prior to the development and publication of AMSTAR-2 [5]. Here, we reported on 15 systematic reviews of CAM for fibromyalgia, published between 2003 and 2014 which assessed several CAM therapies. Eight of the reviews included a quantitative synthesis.

We subsequently completed a second overview of reviews of CAM for infantile colic published in 2019 [10]. Here, we used the new AMSTAR-2 tool alongside ROBIS. We reported on 16 systematic reviews of CAM for colic, published between 2011 and 2018. The reviews investigated several CAM therapies, 12 of which included a quantitative synthesis.

We later returned to the fibromyalgia review papers and reassessed them all using the AMSTAR-2 scale, for consistency. This results in a total comparison of 31 reviews. The reviewers were not strict about the order of ratings.

Assessment of methodological quality/bias of the included reviews

Three reviewers (RP, VL, PD) independently assessed each systematic review using both tools. Any reported meta-analyses were checked by a statistician experienced in meta-analyses (CP). The final score was agreed after discussion between the authors.

Data-analysis

Gwet’s AC statistic was used to calculate inter-rater reliability (IRR) [11]. Gwet’s AC2 is a weighted statistic which allows for “partial agreement” between ordinal categories. Therefore, Gwet’s AC2 was used to calculate IRR (using linear weights) for AMSTAR-2 questions with options “no”, “partial yes” and “yes” (questions 2, 4, 7, 8, 9). Gwet’s AC1 is an unweighted statistic which measures full agreement only. Gwet’s AC1 was used for all other AMSTAR-2 questions.

All signalling questions for ROBIS were analysed using Gwet’s AC2 with linear weights where “no”, “probably no”, “probably yes” and “yes” were recoded as 1–4. As mentioned above, Gwet’s AC2 is a weighted statistic which allows for “partial agreement” between ordinal categories. Ratings of “no information” were treated as missing. Gwet’s AC1 was used for ROBIS domains. Agreement for AMSTAR-2 and ROBIS was classified as “poor” (≤ 0.00), “slight” (0.01–0.20), “fair” (0.21–0.40), “moderate” (0.41–0.60), “substantial” (0.61–0.80), and “almost perfect” (0.81–1.00), following accepted criteria [12]. All analyses were completed using Stata 16 (StataCorp. 2019; Stata Statistical Software).

Results

Our first objective was to compare the content of the tools (see Table 1). Any overlaps and discrepancies between the two scales are identified. Overall, we found considerable overlap on the signalling questions. However, ROBIS does not assess whether there is a comprehensive list of studies (both included and excluded) or whether any conflicts of interest were declared (both at the individual trial level and for the reviews), as these are considered issues of methodology quality rather than bias. AMSTAR-2 also assessed possible conflicts of interest, which is not assessed in ROBIS, despite being a potential risk of bias. However, the section on synthesis was given more in-depth consideration in ROBIS tool.

Table 1 A comparison of the content of the two tools (AMSTAR-2 and ROBIS)

A comparison of two assessment tools used in overviews of systematic reviews: ROBIS versus AMSTAR-2

Abstract

Background

Methods

Results

Conclusion

Background

Description of the assessment tools

Assessment of multiple systematic reviews (AMSTAR-2)

ROBIS tool

Previous research

Aim

Objectives

Methods

Assessment of methodological quality/bias of the included reviews

Data-analysis

Results

Section 2: Comparison of the inter-rater reliability of the tools

AMSTAR-2

Results of inter-rater reliability analysis for AMSTAR-2

Summary of the findings on Inter-rater reliability

ROBIS

Summary of the ROBIS results

Results of inter-rater reliability analysis for ROBIS

Summary of the findings on Inter-rater reliability

Section 3: Usability of the tools

Discussion

Summary of findings

Usability of the tools

Relationship to background research

Potential bias in the overview process

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher’s Note

Appendix

Appendix

Results of AMSTAR-2 for CAM for fibromyalgia reviews

Results of ROBIS: CAM for fibromyalgia

Inter-rater agreement for fibromyalgia

Results of AMSTAR-2: CAM for colic

Results of ROBIS: CAM for colic

Inter-rater agreement for colic

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Systematic Reviews

Contact us