A surveillance system to assess the need for updating systematic reviews

Background Systematic reviews (SRs) can become outdated as new evidence emerges over time. Organizations that produce SRs need a surveillance method to determine when reviews are likely to require updating. This report describes the development and initial results of a surveillance system to assess SRs produced by the Agency for Healthcare Research and Quality (AHRQ) Evidence-based Practice Center (EPC) Program. Methods Twenty-four SRs were assessed using existing methods that incorporate limited literature searches, expert opinion, and quantitative methods for the presence of signals triggering the need for updating. The system was designed to begin surveillance six months after the release of the original review, and thenceforth every six months for any review not classified as being a high priority for updating. The outcome of each round of surveillance was a classification of the SR as being low, medium or high priority for updating. Results Twenty-four SRs underwent surveillance at least once, and ten underwent surveillance a second time during the 18 months of the program. Two SRs were classified as high, five as medium, and 17 as low priority for updating. The time lapse between the searches conducted for the original reports and the updated searches (search time lapse - STL) ranged from 11 months to 62 months: The STL for the high priority reports were 29 months and 54 months; those for medium priority reports ranged from 19 to 62 months; and those for low priority reports ranged from 11 to 33 months. Neither the STL nor the number of new relevant articles was perfectly associated with a signal for updating. Challenges of implementing the surveillance system included determining what constituted the actual conclusions of an SR that required assessing; and sometimes poor response rates of experts. Conclusion In this system of regular surveillance of 24 systematic reviews on a variety of clinical interventions produced by a leading organization, about 70% of reviews were determined to have a low priority for updating. Evidence suggests that the time period for surveillance is yearly rather than the six months used in this project.


Background
Systematic reviews (SRs) on the effectiveness and safety of various health interventions are the basis for clinical practice guidelines, public and corporate policy, and clinical and consumer decision-making. These SRs provide systematically searched, collected, evaluated, and synthesized scientific evidence to objectively compare the effectiveness, benefits, and safety of different health interventions. The production of SRs is based on standardized, structured, and explicit methodological guidance. The SRs endeavor to focus on patient-relevant outcomes (for example, mortality, pain, quality of life, functional status, myocardial infarction) in addition to relevant intermediate surrogate outcome measures (for example, cholesterol levels, serum glucose levels, red blood cell count) [1].
Systematic reviews may be conducted by independent groups of researchers or by researchers associated with large organizations such as the Cochrane Collaboration; the United States Agency for Healthcare Research and Quality (AHRQ), which administers a group of Evidence-based Practice Centers (EPC) throughout North America; and the National Institute for Health and Clinical Excellence (NICE) in the UK [2]. A primary responsibility of these organizations is the conduct of systematic reviews, the results of which are often posted on their websites.
The inevitable -and rapid -accumulation of new research findings has raised concern among these organizations about how best to identify which reviews may be out of date and whether to sponsor an update or simply remove the outdated review from their websites. To date, organizations and initiatives (for example, Cochrane Collaboration, Drug Effectiveness Review Project (DERP)) have relied on time-based (for example, annual, biennial) periodic updating policies that have proven to be problematic in terms of feasibility and efficiency [3][4][5]. However several lines of evidence demonstrate that reviews become obsolete at different rates, suggesting that a system of regular surveillance might be a more effective way of identifying potentially out-of-date reviews. In 2006, the DERP implemented a strategy for assessing the need for updating systematic reviews of comparative effectiveness and safety of drug interventions evaluated in controlled clinical trials [6]. The DERP's stakeholders need to make coverage decisions for new drugs, and therefore the appearance of a new drug is a strong signal for an update. However, not all SR users (for example guideline developers) might consider a new drug within an established class (such as a new statin or angiotensin receptor antagonist) as an indication of the need for an update. Furthermore, SRs may deal with non-pharmacologic interventions (for example, diagnostic screening) and include observational studies.
AHRQ supported a pilot study comparing different methods to assess signals for the need to update SRs and another study to assess an initial set of SRs that were considered Comparative Effectiveness Reviews (CERs) for the need to update. CERs are systematic reviews that aim to compare the benefit and harms of a range of options rather than only answering a narrow question on safety and effectiveness of a single therapy [2]. Based on these pilot studies, AHRQ supported the development of a surveillance system for regularly monitoring AHRQ's portfolio of SRs. This article presents the results of the surveillance system covering June 2011 to November 2012.

Methods
The surveillance system -summary overview Two EPCs (RAND, University of Ottawa) participated in the development of the surveillance system; a third EPC (ECRI) assisted in obtaining safety alerts). The RAND and Ottawa EPCs had independently developed methods to assess SRs for the need to update [7,8]; a formal comparison of the two showed they produced similar results [9]. In developing and implementing the surveillance system, we operationalized a proposal made in our earlier CER surveillance report for what such a system would look like (see Figure 1). This article describes the surveillance assessment of 24 consecutive SRs conducted for the AHRQ Effective Health Care's Comparative Effectiveness Review program .
The surveillance system was designed to conduct an assessment of a SR six months after its release and every six months thereafter until the assessment identified signals sufficient to classify it as 'high priority' for an update. Briefly, six months after release of a SR, we conduct abbreviated literature searches, using the strategy employed in the original SR, but limited to five general medicine journals and approximately five specialty journals specific to the topic of the SR and, with a few exceptions. Newly identified evidence relevant to the key questions and the original conclusions was abstracted, and pre-specified criteria were used to detect the presence of qualitative and/or quantitative signals for updating (EPC SRs are organized around a set of key questions, each of which might have multiple parts, resulting in the need for multiple conclusions) [7]. The method also incorporates expert opinion regarding the validity or currency of conclusions reached in the SR and government safety alerts relevant to the SR [8]. Based on a combination of the weight of the evidence, signals, and expert opinion, a determination was made regarding the need to update each conclusion for each key question, with the expectation that a change in conclusion may yield a change in clinical practice. That is, each key question (KQ)-specific conclusion within a SR was categorized as up-to-date, possibly out-of-date, probably out-of-date, or out-of-date [8]. Finally, based on: 1) the proportion of key questions whose conclusions were determined to require updating or the urgency to update a particular set of conclusions, and 2) the extent of outdatedness, a global assessment of priority status was assigned to updating the full report (high, medium, or low), and the results of the process were summarized in a brief report. SRs assigned a low or medium priority for updating were re-assessed six months later. Reports assigned a high priority for updating were not reassessed. The decision to update or withdraw the report is made by AHRQ, who consider the availability of resources and other factors when making a final decision. Detailed methods of the surveillance process are presented as the following: Abbreviated search, study selection, and data extraction The ascertainment of updating signals relied on qualitative and/or quantitative criteria developed originally for the Ottawa method [7] and expert opinion as used in the RAND method [8]. For each SR, we conducted an abbreviated update search as described in previous publications [7,9,34]. We employed the strategies used in the original published SRs but limited the sources searched to five general medical journals (Annals of Internal Medicine, BMJ, JAMA, The Lancet, and New England Journal of Medicine) and approximately five topic-specific specialty journals (usually the journals that contributed the most evidence to the original report; (if a particular specialty journal was not catalogued in PubMed, we would search the more relevant database as well). These searches were conducted for a time period starting six months prior to the last date covered by the searches for the original SR  Figure 1 The process of surveillance assessment for a systematic review (SR). Figure 1 portrays the overall process of surveillance assessment for an SR that mainly includes: 1) literature search, 2) contacting experts, and 3) obtaining safety alerts from various sources sent by ECRI (one of the AHRQ evidence-based centers). The number of hits identified by literature search would be transferred to Reference Manager database and then will be screened by: 1) title and abstract, and 2) full text. The data was extracted from the number of studies that were deemed eligible for inclusion. Next, the extracted data was assessed for identifying qualitative and quantitative signals. Then, the findings from literature, expert opinion, and safety alerts were collated and assessed for updating priority status (high, medium or low). If an SR was deemed as 'high' priority for assessment, it was referred to AHRQ for updating. If an SR was deemed as 'medium' or 'high' priority for updating, it was re-assessed six months after the completion of the first assessment.
(to minimize the number of relevant studies missed due to delayed publication) up to the present. We also assessed the eligibility of studies referenced by content experts (further detail on the experts in the following sections). After removing duplicates from identified records, one reviewer used the inclusion/exclusion criteria specified in the original SR to screen titles and abstracts and then full texts of potentially relevant records. For each included new study, one reviewer extracted relevant data on study characteristics (for example, design, sample size, follow-up duration), demographic factors for study participants (for example, age, sex, condition), treatment (for example, type, frequency, dose), outcome characteristics, and results into an evidence table.

Ascertainment of updating signals
To identify signals/triggers for updating, we applied qualitative and/or quantitative criteria [7] to the abstracted evidence for each conclusion in the original SR. For each conclusion, we first documented the absence of new evidence (that is, no new evidence or new evidence showing the same or similar conclusion as the original SR) or the presence of new evidence meeting the pre-defined criteria of signal(s) indicating a need for updating (Table 1). We then assessed whether new evidence provided or contributed to a qualitative or quantitative signal. One example of a qualitative signal might include finding a newly published pivotal trial with results opposite to that of the original SR with respect to an efficacy outcome (for example, effective versus ineffective or vice versa) or a harm (for example, a newly identified risk of harm that outweighs the previously observed benefits). The original definition of a pivotal trial was one published in one of the top five general medical journals or a trial whose sample size was at least triple that of the largest trial in the original SR [7]. For this application we made some adaptations to account for key questions for which observational studies were the study design of choice; namely we did not require new large cohort studies to have at least three times the number of participants as existing large cohorts. Other examples of qualitative signals included a superior new treatment (for example, a new treatment significantly more effective than one assessed in the SR); or a new population subgroup (that is, the treatment assessed in the SR has subsequently been tested on a new population). In contrast, new evidence generates a quantitative signal if its incorporation into a SR's original meta-analysis changes a statistically non-significant pooled estimate into a statistically significant one or vice versa [7].

Clinical content experts
We identified and contacted two sets of clinical experts: a) those who had worked on the SR in question (for example, the project lead, clinical lead, members of the technical expert panel, and peer reviewers) and b) other clinical experts in the clinical content area who had not worked on the SR in question (for example, local or external subject matter experts). For each SR, we created a matrix that included each of the original key questions and a summary of each conclusion in the original report.
Respondents were asked to provide their opinions on whether or not each conclusion was still valid. They were also asked to provide reference citations for any new studies they were aware of that might invalidate or otherwise alter the conclusion(s) as well as studies that were pertinent to the topic but might not address a particular conclusion directly (for example, studies of newer treatments that may have rendered the original treatments out-of-date). The responding experts were offered a small honorarium; reminders were sent to experts who did not initially respond.

Safety alerts
We examined safety and adverse event alerts relevant to each SR. This information was collected from Med-Watch, the US Food and Drug Administration's Safety Information and Adverse Event reporting system; the UK's Medicines and Health Care Products Regulatory Agency (MHRA); and Health Canada.

Determination of updating status for SRs
The information on updating signals, expert opinion, and safety alerts was collated, summarized, and tabulated. Taking into consideration the totality of evidence, we used a set of decision rules/guidance originally used in the pilot studies [8,9] to characterize any given KQrelated conclusion(s) as up-to-date, possibly out-of-date, probably out-of-date, or out-of-date. Based on the totality of these characterizations, each SR was assigned to high, medium, or low updating priority groups. The decision to assign a high priority was not based strictly on the proportion of conclusions determined to be probably or definitely out-of-date, but rather, was a global judgment informed by a set of guidelines; for example, one out-of-date conclusion that could result in harm or inferior treatment could give rise to a high priority for updating. The criteria for determining updating status are provided in Additional file 1.
For each of the SRs that underwent surveillance, we summarized our findings in a brief report. These reports are now posted on the AHRQ website along with the original SRs to which they refer.

Assessment of the findings across SRs
To gain a sense of how long it takes for SRs to go outof-date, we assessed the proportion of the SRs that went through the surveillance process at least once that received a high or medium priority for updating as a function of the length of time since their publication and from the date of their latest searches.

Sampling of SRs for assessment
Between June 2011 and November 2012, we assessed 24 SRs at least once. When we implemented the surveillance system, a backlog of SRs had accumulated and needed to be assessed. In addition, there was a 3-to 17-month lag between the completion of the original or update search and the release of the reports. Thus, there was a time span of 11 to 62 months from the completion of the original or update searches and the surveillance search ( Table 2).
The characteristics of the 24 SRs and the corresponding surveillance assessments are presented in Table 2. Briefly, the number of key questions (the questions that frame AHRQ SRs) across the 24 SRs ranged from three [10,17,26,33] to seven [12,13,27], although each key Table 1 Criteria for determining that a conclusion is out-of-date Ottawa's label Ottawa method Qualitative criteria for potentially invalidating signals

A1
Opposing findings: a pivotal * trial or systematic review (or guidelines) including at least one new trial that characterized the treatment in terms opposite to those used earlier A2 Substantial harm: a pivotal trial or systematic review (or guidelines) whose results called into question the use of the treatment based on evidence of harm or that did not proscribe use entirely but did potentially affect clinical decision-making A3 A superior new treatment: a pivotal trial or systematic review (or guidelines) whose results identified another treatment as significantly superior to the one evaluated in the original review, based on efficacy or harm Original conclusion is possibly out-of-date and this portion of the original report may need updating. This conclusion was reached if we found some new evidence that might change the CER conclusion, and/or a minority of responding experts assessed the CER conclusion as having new evidence that might change the conclusion, then we classified the CER conclusion as possibly out-of-date 3 Original conclusion is probably out-of-date and this portion of the original report may need updating. This conclusion was reached if we found substantial new evidence that might change the CER conclusion, and/or a majority of responding experts assessed the CER conclusion as having new evidence that might change the conclusion, then we classified the CER conclusion as probably out-of-date 4 Original conclusion is out-of-date. This conclusion was reached if we found new evidence that rendered the CER conclusion out-ofdate or no longer applicable; we classified the CER conclusion as out-of-date. Recognizing that our literature searches were limited, we reserved this category only for situations where a limited search would produce prima facie evidence that a conclusion was out-of-date, such as the withdrawal of a drug or surgical device from the market, a black box warning from FDA, and so on Abbreviation: CER comparative effectiveness review, FDA Food and Drug Administration. *a pivotal trial is defined as trial that is published in one of the top five general medical journals or a trial whose sample size is at least triple that of the largest trial in the original systematic review. Legend: Table 1 presents the criteria used to determine if a conclusion is out of date within an SR (here CER). Criteria A1 to B2 come from the Ottawa method and criteria 1 to 4 are based on the RAND method.    [30]. The response rates ranged from 20% (2/10) to 100% (6/6) with a median of 35%.
Ten SR topics underwent a second surveillance assessment. For those SRs, we contacted only those experts who had responded in the first round. Across these ten SRs, 39 experts were contacted, and 27 responded, with response rates ranging from 40% to 100%. Median response rate was 71%, double the 35% median response rates across all topics on the first round. Across these ten SRs that underwent a second surveillance assessment at about six months from the end of the prior assessment, there were 265 conclusions contained within 53 key questions. Of these, eight conclusions changed between the first and second surveillance: seven conclusions changed from 'up-to-date' to 'possibly out-of-date' , and one conclusion changed from 'possibly out-of-date' to 'probably out-of-date'. One of the ten SRs changed priority for updating from 'low' to 'medium'.

Factors associated with priority decisions
We assessed whether the length of time that had elapsed between the search conducted for the original report and the update surveillance search (search time lapse, STL) was associated with priority status for updating. Seven SRs were released prior to January 2010 [10,11,17,18,20,22,26] (that is, more than 18 months before the start of the Surveillance Program); of these seven, two were the SRs judged as being 'high' priority for updating, three were judged as being 'medium' priority, and two were judged as being 'low' priority for updating. Of the remaining 17 SRs, released after January 2010, only two were judged as being 'medium' priority for updating and the rest were low priority. All SRs released within the year prior to the start of the Surveillance Program (between June 2010 and June 2011) were judged as being 'low' priority. Figure 2a and b present the updating priority decisions for the 24 SRs by the time elapsed since the search date in the original review (2a) and the number of new relevant articles identified during the surveillance process (2b). While more SRs were classified as medium or high priority for updating as both the STL and the number of new relevant articles increased, there was substantial overlap, and no threshold existed for either time or number of articles that could accurately predict classification of SRs into different categories.

The possible role of safety alerts
We identified applicable safety alerts for 9 of the 24 SRs assessed. FDA provided alerts for all nine of those SRs [12,13,15,17,20,27,28,31,33]; MHRA and Health Canada were the sources of alerts for only one SR [33]. None of the agents, devices, or procedures evaluated in the 24 SRs for which we performed the surveillance assessments had an FDA black box warning (the strongest FDA warning, indicating a significant risk of serious or even life-threatening adverse effect) issued during our assessment period. In only one case was the updating priority of a SR influenced by a safety alert [27].

Discussion
Our results indicate that a small proportion of AHRQsupported SRs may need updating within one to two years of the date of their last search. Of the 24 SRs assessed between June 2011 and November 2012, 17 (71%) were classified as having low priority for updating, and five SRs (21%) had medium priority for updating. Only two SRs (8%) were deemed to have high priority. Greater elapsed time from the end date of the original search and a larger number of new relevant studies were both associated with a higher priority for updating, but no thresholds were identified that could perfectly classify SRs into priority categories. This finding suggests that expert opinion will be a necessary component of an efficient system of searching for signals for updating.
Several of the SRs were classified as low priority for updating despite having a large number of newly identified potentially relevant studies. One explanation for this finding is that, in general, many of these new studies had small sample sizes or few primary outcomes and the results were consistent with those of the original SRs, thus not justifying updating those existing SRs. Conversely, the presence of a single new study with many outcome Table 3 Currency of individual conclusions within each key questions of the of 24 comparative effectiveness reviews (CERs) and their priority status for updating (high, medium, and low) based on the updating surveillance assessments   events can be a sufficient signal of the need for a high priority update, such as the publication of the Prostate Cancer Intervention Versus Observation Trial (PIVOT) [52] and the SR on therapies for clinically localized prostate cancer [22]. A recent study that examined factors that predicted 69 decisions on whether to update 41 reviews of drug effectiveness found that the number of relevant new studies was a significant predictor of a decision to update a review (OR 1.06 for each new trial) [6]. This study, conducted for the Drug Effectiveness Review Project (DERP), was designed to examine the surveillance process implemented in 2006 to replace what had been a policy of mandatory annual updates. The DERP process is qualitatively similar to our surveillance method, in that it uses limited literature searches, information from FDA and Health Canada, and expert input. The study also found that identification of a new drug significantly increased the likelihood of an update (OR = 5.71) and that reviews of psychiatric drugs were always recommended for an update. The authors did not report whether there were thresholds of articles or time that perfectly predicted decisions for updating. A major difference between that study and ours, aside from our broader focus on all types of clinical interventions, is that the decision to update a DERP report rests with a panel of participants comprising physicians and representatives of the state Medicaid agencies and the Canadian Agency for Drug Technology and Health, for whom the appearance of a new drug requires them to make policy decision.

Lessons learned
The implementation of the surveillance assessment program to determine the currency of published AHRQ SRs has presented a number of challenges. These challenges included differences across reports in the ways conclusions were presented, the responsiveness of report staff and experts, and delays in the release of the original reports themselves combined with differences in the length of time between release and surveillance.

Inconsistency in presentation of conclusions
Not all SRs presented their KQs and the corresponding conclusions in the executive summary in a similar manner (that is, the degree of detail, format, or level of summarization may have varied). For example, in some SRs, conclusions were, by necessity, stratified by subpopulation, intervention, outcome, or other study characteristics, resulting in multiple conclusions for a single question. In some SRs, the executive summaries failed to present sufficient detail to enable reviewers to extract at least one specific, clearly formulated conclusion for each key question; therefore, the reviewers had to probe the entire text of the SR report. Conversely, some executive summaries simply reproduced the results from the report text without drawing any conclusions, leaving the experts to whom we sent the information to draw their own conclusions. Some conclusions were not readily amenable to updating, for example, conclusions regarding the prevalence of certain risk factors in specific populations.

Responsiveness of report staff and experts
Conducting the surveillance on schedule required that the project leads for the original reports and the experts they recommended we contact respond in a timely manner. However, project leads and experts varied widely in their responsiveness to our requests. In addition, response rates were low in the first surveillance. However, it is unclear what this low response means, since the sample is not intended to be a random sample of some larger population. In the second round of surveillance, the response rate improved considerably, suggesting that over time, the surveillance process will become more efficient.

Delays in release of some reports
In several cases, surveillance was delayed because a report was not released on schedule. The primary impact of such delays was on our staff's ability to plan their work schedules, as they would have reserved time for these reports and would need to find other surveillance work or work on our own evidence reviews when a report expected for surveillance failed to materialize.

Limitations
One limitation of the surveillance system is that it requires subjective global judgments. The assessment of currency and validity of conclusions for each key question in a SR was based on the totality of information compiled through multiple sources such as the qualitative/quantitative signals, expert opinion, and safety alerts. Although we used operational and standardized definitions throughout the process to promote consistency in the assessments, the overall judgment must necessarily be subjective in characterizing individual conclusions. However, since neither the STL nor the number of new relevant studies can classify SRs perfectly as low, medium, or high priority status for updating, this subjective human assessment is going to be needed in an efficient surveillance system. Future work should seek to make these judgments as reliable as possible across raters. The strength of evidence should be investigated in future work.
A second limitation is that we present data for only 34 surveillance assessments on 24 SRs. However, only two published evaluations have included more assessments than ours. A study by Shojania and colleagues assessed 100 systematic reviews to determine how quickly they go out-ofdate, but this study limited its sample to meta-analyses that produced a summary estimate of outcome, and then further limited the analysis to only one outcome per study [7]. The DERP study reported the results of surveillance on 41 of their reports [6], but these reports assessed only drugs, and the decisions about updating were made by stakeholders for whom the approval of a new drug was highly relevant to policy decision-making. Our study, by contrast, assesses a broad array of health care interventions, and considered changes in evidence that might lead to changes in practice as the criterion for a signal for updating.
In sum, we found that only a small proportion of AHRQsponsored systematic reviews triggered signals for updating within one or two years of the date of their last search, and that neither the elapsed time since the original search nor the number of new articles could perfectly predict which SRs may be in need of updating. Our experience also provided some evidence into what might be the optimal time for a first assessment and subsequent surveillance assessments. Among the 24 SRs released within the first 18 months of surveillance, only two were classified as high priority, five were classified as medium, and the rest were classified as 'low' priority for updating (and a number of these reports had been released up to four years prior to the start of the surveillance). Furthermore, there were few changes in conclusions about updating in a second round of surveillance timed to start six months after the completion of the first round. These results suggest to us that a one-year time period between the release of a report and its first and subsequent surveillance assessments may be more efficient than the six-month time frame chosen for this application.