The quality of clinical practice guidelines for preoperative care using the AGREE II instrument: a systematic review

Background Our aim was to summarize and compare relevant recommendations from evidence-based CPGs (EB-CPGs). Methods Systematic review of clinical practice guidelines. Data sources: PubMed, EMBase, Cochrane Library, LILACS, Tripdatabase, and additional sources. In July 2017, we searched CPGs that were published in the last 10 years, without language restrictions, in electronic databases, and also searched specific CPG sources, reference lists, and consulted experts. Pairs of independent reviewers selected EB-CPGs and rated their methodological quality using the AGREE-II instrument. We summarized recommendations, its supporting evidence, and strength of recommendations according to the GRADE methodology. Results We included 16 EB-CPGs out of 2262 references identified. Only nine of them had searches within the last 5 years and seven used GRADE. The median (percentile 25–75) AGREE-II scores for rigor of development was 49% (35–76%) and the domain “applicability” obtained the worst score 16% (9–31%). We summarized 31 risk stratification recommendations, 21.6% of which were supported by high/moderate quality of evidence (41% of them were strong recommendations), and 16 therapeutic/preventive recommendations, 59% of which were supported by high/moderate quality of evidence (75.7% strong). We found inconsistency in ratings of evidence level. “Guidelines’ applicability” and “monitoring” were the most deficient domains. Only half of the EB-CPGs were updated in the past 5 years. Conclusions We present many strong recommendations that are ready to be considered for implementation as well as others to be interrupted, and we reveal opportunities to improve guidelines’ quality.


Implication statement
We identified many risk stratification and therapeutic strong recommendations that can be implemented and other ones usually followed by many anesthesiologists in their daily practice that should be interrupted.
Finally, we described opportunities to improve guidelines' quality.

Background
An estimated 313 million major surgical procedures are undertaken every year worldwide [1]. Low-and highincome countries show an estimated rate of major surgery of 295 and 11,110 procedures per 100,000 population per year respectively [1], an enormous disparity for the recommended minimum threshold of 5000 operations per 100,000 people, that is associated with desirable health outcomes. At current rates of surgical and population growth, 6.2 billion people (73% of the world's population) will be living in countries below the minimum recommended rate of surgical care in 2035 [2].
However, the crude number of patients who receive surgery is increasing, as well as their mean age and the occurrence of comorbidities [3]. Because of the inherent risks of death and complications, surgical safety is a significant public-health concern. As examples, 2.4% (95% CI 2.1 to 2.6%) of patients undergoing surgery will suffer major cardiac complications [4], and 5% (95% CI 4.5 to 5.5%) will have a perioperative myocardial infarction [5]. In this context, to provide adequate preoperative care is truly mandatory. The first routine preoperative tests started 50 years ago with only a handful of actions and have nowadays expanded to a large set of risk stratification or diagnostic tests to define the preoperative clinical risk categories and also many preventive interventions. Lately, efforts to standardize care have been made, specially through the implementation of clinical practice guidelines (CPGs) with recommendations useful both for health providers and patients [6]. These recommendations usually consider all risks and benefits for a risk stratification or therapeutic procedure to be undertaken, sometimes even including algorithm pathways. The potential benefits, like the safety of care and standardization of procedures, are only as good as the quality of the practice guidelines implemented. Unfortunately, those CPGs not supported by the best evidence might promote inappropriate preoperative testing behaviors, negative both for patients and health systems. For example, false positive results, coming from inappropriate testing, may delay or prevent surgery, thus creating unnecessary stress or harm to patients.
Multiple medical societies and organizations around the world have published preoperative evaluation CPGs; however, many of them are not even based on solid scientific evidence. Additionally, not all of them harness methods like the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach, which is one of the soundest system for rating the quality of a body of evidence in systematic reviews and CPGs [7]. GRADE offers a transparent and structured process for developing and presenting evidence summaries and making recommendations [7].
A systematic review found no evidence from high quality studies to support routine preoperative tests in healthy adults undergoing non-cardiac surgery [8]. Risk stratification testing based on the problems identified during the preoperative assessment seems justified, but there is still little evidence supporting it [8]. In this way, the implementation of EB-CPGs may lead to a reduction in the number of unnecessary preoperative tests, without affecting patient safety [9][10][11][12][13]. The first health technology assessment (HTA) on the topic published in 1989 by the Swedish Council on Technology Assessment in Health Care (SBU) [14], showed healthcare quality improvements and cost savings using an evidence-based approach. The findings of this report have been confirmed by nine other subsequent studies from five countries, collected in another HTA document [15].
For this reason, through an overview of clinical practice guidelines, we aimed to identify and synthetize EB-CPGs on preoperative care that were published worldwide in the last 10 years, in order to help prioritization processes. We also rated CPGs' quality and summarized recommendations describing their level of evidence and the strength of recommendations according to the GRADE approach [7].

Methods
Study design We performed a systematic review (overview) of EB-CPGs following Cochrane methods [16] and the Argentinean Academy of Medicine's Guide for the adaptation of CPGs for searching and selecting CPGs [17]. For reporting, we followed the PRISMA statement [18] and a specific guideline for overviews of systematic reviews (Online supplemental material. Appendix 1. PRISMA checklist) [19]. The protocol is available in Spanish including a summary in English. 1 We aimed to identify the most reliable CPGs; therefore, we used a definition for EB-CPGs previously reported [20]. The inclusion eligibility criteria (all criteria required) were as follows: a) CPGs of perioperative care published in the last 10 years including those recommendations potentially applicable to any kind of surgery, not site or condition-specific b) b) Provides a list of the CPG development panel members including their expertise or qualifications. c) Use standard methods such as Cochrane methods, Equator Network-proposed checklists, or any sufficiently detailed method allowing reproducibility of the identification, data collection, and study risk of bias assessment. d) Report of the level of evidence that supports each recommendation Exclusion criteria (any criterion required) were as follows: a) Guidelines limited to single specific conditions such as obesity, renal disorders, or pheochromocytoma b) Guidelines limited to single specific body part surgeries such as neurosurgery or colorectal surgery. c) Guidelines including some recommendations but whose entire focus was clearly not the preoperative care Selection and data extraction Pair of reviewers independently selected (by title and abstract first, and full text eligible studies afterwards) the articles retrieved, with a specific software to facilitate the initial phases of systematic reviews called Early Review Organizing Software (EROS) [21]. One reviewer extracted them while the other verified the data in a previously piloted form (which included variables such as search date, objective, setting, target population, target professionals, recommendations, classification system of the quality of evidence and of the strength of the recommendation, quality of evidence by recommendation, and the strength of each recommendation) and preoperative clinical risk criteria and categories (see Online supplemental material 3. Preoperative clinical risk criteria and categories). Discrepancies were resolved by a consensus of the whole team.
Guideline quality appraisal and classification Independent pairs of reviewers rated each EB-CPGs using the AGREE-II tool consisting of 23 key items organized in six domains: scope and purpose, stakeholder involvement, rigor of development, clarity of presentation, applicability, editorial independence, and two overall evaluation items [22]. Each item was graded using a scale of 7 points: from 1, meaning "Strongly disagree," to 7, meaning "Strongly agree." The total was presented as a percentage of the maximum possible score for that domain (from 0 to 100%). We present the AGREE-II domain scores expressed as a percentage across CPGs (Online Supplemental material. Appendix 5 with the explanation of each items of the AGREE-II domains). Discrepancies were resolved by a consensus of the whole team. We also categorized each EB-CPGs according to the extent to which they successfully addressed AGREE-II criteria as follows [17]: Strongly recommended (++), CPG whose standardized score exceeds 60% in ≥ 4 AGREE-II domains. The scores of the remaining domains must be ≥ 30% and > 60% for the domain rigor of development. Recommended (+), CPG whose standardized score ranges from 30 to 60% in ≥ 4 AGREE-II domains. The rigor of development score must be between 30 and 60%. Not recommended (-), CPG whose standardized score is < 30% in ≥ 4 AGREE-II domains or if rigor of development score is less than 30%.
To deal with discrepancies between the direction and strength of the CPG recommendations, we applied a rule to decide "doing or not doing the recommendation": Yes (Y)-no (N) to doing it: ≥ 2/3 recommendations in the same direction (for/against) and ≥ 2/3 strong recommendations.

Synthesis of results
We conducted a tabular synthesis of the recommendations to describe their strength and the level of evidence supporting them according to the current GRADE methodology [7], and transforming the original grading system when necessary, to compare and integrate the results for each recommendation in a unified manner. Simply put, the GRADE quality of evidence can be HIGH, MODERATE, LOW and VERY LOW. The Randomized Clinical Trials (RCTs) start from HIGH quality of evidence, and the non-randomized studies start from a LOW quality of evidence. Five criteria can downgrade one or two levels: methodological quality (study limitations), inconsistency of results, indirectness, imprecision, and publication bias. In cases where there are no methodological limitations, there are three criteria that can upgrade one or two levels: magnitude of effect, dose-response effect, and confounders underestimating the effect. For mapping the level of evidence to a common grading system (GRADE), we reassessed all evidence when the translation was not obvious. Pair of reviewers independently extracted or reassessed the level of evidence, and discrepancies were resolved by a consensus of the whole team. Regarding the strength of a recommendation, which is defined as the extent to which one can be confident that the desirable consequences of an intervention outweigh its undesirable consequences, GRADE uses four simple categories to classify them. The categories are "strong" or "weak" and "for" or "against" a certain risk stratification or therapeutic approach. We presented descriptive statistics as percentages or means with standard deviations.

Search results
The search strategy identified 2262 references after the elimination of duplicates. After the selection process, we identified 23 references corresponding to 16 EB-CPGs published in the last 10 years (Fig. 1 flowchart). Two references were examined in depth and eventually excluded since they only transcribed pre-existing CPGs, already included in our selection [23,24]. Table 1 provides a general description of the included EB-CPGs. Seven were developed in America (4 in the USA, 1 in Argentina, 1 in Brazil, and 1 in Canada), seven in Europe (2 continental, 2 from Italy, 1 in Belgium, 1 in Scotland, and 1 in UK), and two were global collaborations. Only 8/16 (50%) of the EB-CPGs that reported their search date, conducted their searches within the last 5 years. Out of the EB-CPGs, ten addressed multiple practices, five focused on unique practices, one referred to perioperative fasting, and the remaining four were about antimicrobial prophylaxis. Furthermore, four were risk stratification recommendations, five were therapeutic or preventive interventions, and six considered both aspects.

Guideline characteristics
Each guideline reports the levels of evidence and the recommendation grading systems used by their authors. The grading system used were GRADE (7 EB-CPGs), SIGN [41] (2 EB-CPGs), and the others utilized their own or modified systems (Online supplemental material Appendix 4).
We presented the scores as a percentage per each AGREE-II domain. The domains with the best median score (percentile 25-75) were editorial independence 91% (81-100), clarity of presentation 85% (69-97), and scope and objective 80% (65-89). Stakeholder involvement 53% (44-62) and rigor of development 49% (35-76) had an intermediate performance while "applicability" was the most deficient 16% . Regarding the guideline recommendation category, 6/16 (37%) were classified as highly recommended and the rest as recommended (Online supplemental material Appendix 5). An overall AGREE-II score is also presented in Table 1. Table 2 shows the risk stratification recommendations presenting the level of evidence and recommendation strength of the EB-CPG with the highest overall and methodological rigor AGREE-II score. The 31 risk stratification recommendations included 102 specific     We found discrepancies among EB-CPG in 10 out of 102 (10%) risk stratification recommendations After applying the rule "doing or not doing the recommendation," 31 (60 specific) are "doing" and 31 are (39 specific) "not doing" diagnostic evaluations (see Online supplemental material Appendix 6. Table 1 GRADE level of evidence and strength of recommendations by CPG; Table 2 Recommended risk stratification evaluations only and Table 3 Not recommended risk stratification evaluations to facilitate the finding of relevant recommendation by different point of access). Table 3 shows the therapeutic/preventive recommendations using the same presenting criterion in Table 2. The 16 therapeutic/preventive recommendations included 78 specific recommendations according to the The presented level of evidence and recommendation strength comes from the EB-CPG with the highest overall and methodological rigor AGREE-II score. The level of evidence and recommendation strength by EB-CPG is presented in the online supplemental material 6.a  We found discrepancies among EB-CPG in 3 out of 78 (4%) of the therapeutic/preventive care recommendations. After applying the direction and strength of recommendations rule to decide doing or not doing the CPG, 15 (55 specific) recommended and 10 (23 specific) did not recommended therapeutic/preventive interventions (see Online supplemental material Appendix 8, Online supplemental material Appendix 6 - Table 1 GRADE level of evidence and strength of recommendations by CPG; Table 2 Recommended therapeutic/preventive care only, and Table 3 Not recommended therapeutic/preventive care).

Discussion
To the best of our knowledge, the present study is the first overview of guidelines encompassing a broad spectrum of preoperative care recommendations.
We observed higher level of evidence supporting therapeutic than risk stratification recommendations (high/ moderate quality of evidence 59 vs 22%, respectively). It is not surprising because cross-sectional or cohort studies can provide high-quality evidence for test accuracy but indirect evidence for patient-important outcomes. Furthermore, highs level of heterogeneity is almost the rule in risk stratifications test, downgrading even more the level of evidence because of inconsistency [42][43][44].
The strength of a recommendation is defined as the extent to which one can be confident that the desirable effects of an intervention outweigh its undesirable ones. We found only 12/53 (23%) "strong" risk stratification recommendations statements (for and against) based on high/moderate level of evidence and 43/78 (55%) for therapeutic/preventive care recommendation. Although it would be desirable that higher proportions of high- quality supporting evidence guide panel must consider additional factors. In order to assess competing management alternatives, GRADE proposes to consider four domains: estimates of effect for desirable and undesirable outcomes, confidence in the estimates of effect, values and preferences, and resource use. Guideline panels must integrate these factors to make a strong or weak recommendation for or against an intervention [45]. After our search date, the updated guideline from the European Society of Anesthesiology (ESA) was published, using GRADE and searching until May 2016 [46]. This CPG addressed two main clinical questions in order to help each anesthesiologists in their daily practice: (1) how should a pre-operative consultation clinic be organized and (2) how should pre-operative assessment of a patient be performed. As in our present work, this guideline covered specific conditions that might adversely interfere with anesthesia and surgery, including cardiovascular disease, respiratory disease, smoking, obstructive sleep apnea syndrome, renal disease, diabetes, obesity, coagulation disorders, anemia and pre-operative blood conservation strategies, the geriatric patient, alcohol and drug misuse and addiction, and currently also neuromuscular disease. We are hereby presenting a preoperative clinical risk criteria and categories that were complemented with established risk factors for postoperative pulmonary complications (see Online supplemental material Appendix 3) [46]. The 2018 ESA guidelines also provided independent predictors for difficult mask ventilation, a topic not specifically addressed in previous CPGs [46].
As described, RCTs are still few and therefore many preoperative interventions rely to a large extent on expert opinion, which in turn requires to be adapted to the reality of nations' healthcare systems. This large evidence gap should be addressed by related researchers in order to improve the certainty in evidence-based recommendations.
Studies on prognostic or diagnostic accuracy tests, including scoring of severity of illness, usually provide low quality of evidence, even when scores such as ASA-PS, RCRI, NSQIP-MICA, POSSUM, and others have been extensively validated [46].
Our updated overview of EB-CPGs, conducted under the rigorous Cochrane methods, may be a useful resource for the professionals involved in preoperative care to consult during decision-making. We present many strong recommendations with sufficient evidence to be routinely implemented in clinical practice. However, any decision should be taken considering local contextual factors.
In addition, cost reductions were identified at the clinical level as well as at the health system level in another study [10][11][12]47]. Two guidelines also suggested strong costs benefits both for patients and society [48,49]. Another study showed that the application of EB-CPGs significantly improved the efficiency of the preoperative evaluation without negatively affecting the quality of care [50]. These findings were consistent across different settings, like in a hospital in Barbados where the introduction of guidelines reduced the burden of presurgical tests and costs with not hampering patient's safety [51]. In the same way, a recent study in a hospital in New Jersey, USA, found that approximately 25% of tests were not justifiable and could be thus eliminated by complying with NICE/ ASA guidelines. The evaluation of applying these changes in practice showed significant savings without altering clinical outcomes [52].
Recommendations can be adopted, modified, or even not implemented, depending on institutional or national requirements and legislation and local availability of devices, drugs, and resources [53]. Decision-makers at the national and subnational levels should be provided with the information they need to apply the evidence and recommendations in their setting [54]. As a limitation, including only EB-CPGs could have resulted in omitting some information, but we prioritized summarizing the highest quality evidence. Our exclusion criteria for CPGs, limiting the scope, may represent an additional caveat. Our inclusion/exclusion criteria focused on general recommendations provided a lower amount of evidence for certain practices than if we had also included recommendations for single conditions, specific prophylaxis, or single body part surgeries. Such approach, however, would have compromised the feasibility of our systematic review due to the enormous number of such guidelines. Nonetheless, we provided detailed lists with numerous recommendations and reflected guideline's discrepancies, suggesting that this could not have been a major limitation.
Our study will be useful for future preoperative care guideline developers or adapters. Consistently with other overviews of clinical guidelines, the domain that received the lowest mean score was the "applicability" domain of the AGREE-II tool. Similarly, the heterogeneity of evidence and the strength of recommendation grading systems in this overview echo that of other clinical guideline overviews [55][56][57]. Low scores in the applicability domain result in inadequate adoption rates of guidelines, particularly for preoperative care where "defensive medicine" (i.e., prescribing more tests than necessary just to prevent litigation) is very common. We also found some discrepancies, mainly in the evidence level, in each recommendation that did not always discriminate between universal interventions and those suitable only for special target groups or specific surgeries.
Guideline developers should ensure rigorous methodological processes and should also make recommendations that are formulated and disseminated in ways that facilitate understanding and application by end-users. For example, the DECIDE Collaboration conducted research and developed tools to improve implementation of evidence-based recommendations by different target audiences, including providers, policy makers, and the public [58]. In that sense, GRADE provides guideline developers with a comprehensive and transparent framework for grading quality of evidence and of strength of recommendations.
Our overview identified several controversies, evidence gaps, and issues regarding preoperative care guidelines that warrant future research and reveal opportunities to improve the guidelines quality.
For example, we found many discrepancies about risk stratification recommendations like electrocardiography and chest X-ray, polysomnography, assessment of left ventricular function, stress testing, and coronary angiography in certain populations. We found less discrepancies for therapeutic/ preventive care mainly because antimicrobial prophylaxis use beta-blockers (find these discrepancies in the Online supplemental material Appendix 6 and Appendix 8).
From the perspective of the anesthesiologist practice, there still remain many unanswered questions. For example, in the patient with significant medical, surgical, or obstetrical history, it would be useful to understand how early the pre-anesthetic evaluation should be performed, considering the time required to optimize the patient's status. There are also uncertainties for the recommendation of fasting for solids in adults and children since many factors can delay gastric emptying, and no fixed rules apply. Fasting should be individualized in some patients and depend on the characteristics of the fat intake. Regarding prokinetics and antacids, patients' comorbidities like esophageal pathology, bariatric surgery history, or obesity should be considered in the decision, but there is no formal recommendation. In the same way, suspending or not suspending aspirin should be evaluated according to the patient's history and risk of bleeding of the surgery that could be catastrophic in neurosurgery, spinal surgery, or ophthalmologic surgery. It is also strange that informed consent only has a "weak for," recommendation from a unique CPGs since there is enough background of litigation due to the lack of consent.
We encourage guideline developers to adopt GRADE and AGREE-II tools to elaborate future sound preoperative care guidelines [7,22].
The huge amount of resources involving preoperative care warrants high-quality nationwide EB-CPGs supported by all relevant stakeholders to improve the chances of a successful implementation. This probably includes the involvement of the Ministry of Health, scientific societies, and consumers working together through a formal process of implementation and monitoring [17,59].
Although standardization of preoperative care may be desirable, differences in recommendations could reflect differences in contextual factors such as organizational or financial arrangements, legal framework, varied values and preferences, and the acceptability and feasibility of using different interventions. Research exploring reasons for conflicting recommendations in different countries or settings could also drive overall improvements in guideline quality. The key findings are described in Table 4.
In conclusion we found significant heterogeneity of guidelines' quality and rating systems, as well as deficiencies in several guideline quality domains, which reveal opportunities for quality improvement which deserve careful consideration by future guideline developers. Nevertheless, we present many strong recommendations ready to be at present considered for implementation or discontinuation.

Supplementary information
Supplementary information accompanies this paper at https://doi.org/10. 1186/s13643-020-01404-8.  showed significant heterogeneity both of evidence and recommendation grading systems; GRADE was the most commonly used. • About half of the included EB-CPGs were updated in the last 5 years, and one third of them were rated as strongly recommended based in their high AGREE-II performance. • They were generally deficient in applicability and in providing monitoring tools. • We summarized 31 risk stratification and 16 therapeutic/preventive recommendations. • We found 93 strong for and 46 strong against recommendations, all of which were ready to be considered to be implemented or to be interrupted, respectively. • The level of evidence and strength of recommendation was higher for therapeutic/preventive recommendation than for risk stratification ones. • We only found 12/53 (55%) strong risk stratification recommendations based on high/moderate level of evidence and 43/78 (55%) for therapeutic/preventive care recommendations.