Decoding semi-automated title-abstract screening: a retrospective exploration of the review, study, and publication characteristics associated with accurate relevance predictions

Background. We evaluated the benets and risks of using the Abstrackr machine learning (ML) tool to semi-automate title-abstract screening, and explored whether Abstrackr’s predictions varied by review or study-level characteristics. Methods. For a convenient sample of 16 reviews for which adequate data were available to address our objectives (11 systematic reviews and 5 rapid reviews) we screened a 200-record training set in Abstrackr and downloaded the relevance (relevant or irrelevant) of the remaining records, as predicted by the tool. We retrospectively simulated the liberal-accelerated screening approach. We estimated the time savings and proportion missed compared with dual independent screening. For reviews with pairwise meta-analyses, we evaluated changes to the pooled effects after removing the missed studies. We explored whether the tool’s predictions varied by review and study-level. Results. Using the ML-assisted liberal-accelerated approach, we wrongly excluded 0 to 3 (0 to 14%) records that were included in the nal reports, but saved a median (IQR) 26 (9, 42) hours of screening time. One missed study was included in eight pairwise meta-analyses in one systematic review. The pooled effect for just one of those meta-analyses changed considerably (from MD (95% CI) -1.53 (-2.92, -0.15) to -1.17 (-2.70, 0.36)). Of 802 records in the nal reports, 87% were correctly predicted as relevant. The correctness of the predictions did not differ by review (systematic or rapid, P=0.37) or intervention type (simple or The predictions were more often correct in reviews with multiple vs. single research questions (P=0.01), only trials vs. multiple designs (86%) (P=0.003). the study level, trials mixed methods studies more often correctly predicted as relevant compared with observational studies or reviews (P=0.0006).

concluded that ML tools could be used safely to prioritize relevant records, and cautiously to replace one of two human reviewers [8].
In spite of the clear need to create e ciencies in systematic review production [1,2] and the accrual of evidence highlighting the bene ts and risks [8,9], and usability [10] of available ML tools, the adoption of ML-assisted methods has been slow [8,11,12]. In a 2019 commentary, O'Connor et al. summarized possible barriers to adoption, including distrust by review teams and end users of systematic reviews; setup challenges and incompatibility with traditional work ows; and inadequate awareness of available tools [13]. Most importantly, for widespread adoption to be achieved review teams and other stakeholders need to feel con dent that the application of ML-assisted title-abstract screening does not compromise the validity of the review (i.e., that important studies that could impact the results and conclusions are not erroneously omitted) [14].
Previously published studies undertaken at our evidence synthesis centre [10,15,16] have addressed the bene ts (workload and estimated time savings) and risks (omitting relevant studies) of various MLassisted screening approaches in systematic and rapid reviews. We have also explored the usability of some available ML tools [10]. Despite promising ndings, in the absence of clear guidance or endorsement by evidence synthesis organizations, it remains unclear how ML-assisted methods should (or could) be incorporated into practice [13]. There is also little research documenting under which conditions ML-assisted screening approaches may be most successfully applied. To what extent MLassisted methods could compromise the validity of systematic reviews' ndings is a vitally important, but few studies have reported on this outcome [17]. In this study, we aimed to address these knowledge gaps.
For a sample of 16 reviews, we: 1. evaluated the bene ts (workload savings, estimated time savings) and risks (proportion missed) of using a ML tool's predictions in the context of the liberal-accelerated approach to screening [18,19] in systematic reviews; and assessed the impact of missed studies on the results of the systematic reviews that included pairwise meta-analyses; and 2. explored whether there were differences in the studies correctly predicted to be relevant by the ML tool and those incorrectly predicted to be irrelevant based on review, study, and publication characteristics.

Study conduct
We undertook this study in accordance to an a priori protocol, available at https://doi.org/10.7939/DVN/S0UTUF. We have reported the study in adherence to recommended standards [20].

Sample of reviews
Senior research staff (AG, MG, SAE, JP, and LH) selected a convenient sample of 16 reviews (11 systematic reviews and 5 rapid reviews) either completed or underway at our center. We selected the reviews based on the availability of adequate screening and/or study characteristics data to contribute to our objectives, in a format e ciently amenable to analysis. Table 1 shows the review-level characteristics for each, including the review type (systematic or rapid), research question type (single or multiple), intervention or exposure (simple vs. complex), and included study designs (single vs. multiple). We considered complex interventions to be those that could include multiple components as opposed to a single treatment (e.g., drug, diagnostic test); typically, these were behavioural interventions. Of the reviews, 11 (69%) were systematic reviews, 10 (63%) investigated a single research question, nine (56%) investigated simple interventions or exposures, and four (25%) included only single study designs. The sources searched for each review are in Additional le 1. All systematic reviews used comprehensive searches of electronic databases and grey literature sources, supplemented with reference list scanning.
In the rapid reviews, only electronic databases were searched.
Although many modi cations to standard systematic review methods may be applied in the completion of rapid reviews [21], for the purpose of this study we considered only the screening method. For the systematic reviews, title-abstract screening was completed by two independent reviewers who came to consensus on the studies included in the review. The review team typically included a senior reviewer (the reviewer who oversaw all aspects of the review and who had the most methodological and/or clinical expertise) and a second reviewer (who was involved in screening and often other review processes, like data extraction). For the rapid reviews, the screening was completed by one highly experienced reviewer (the senior reviewer). This approach is considered acceptable when evidence is needed for pressing policy and health system decisions [22].

Machine learning tool: Abstrackr
We used Abstrackr (http://abstrackr.cebm.brown.edu) [23], an online ML tool for title-abstract screening, for this study. Among the many available tools, we chose Abstrackr because it is freely-available and testing at our centre found it to be more reliable and user friendly than other available tools [10]. Experienced reviewers at our centre (n = 8) completed standard review tasks in Abstrackr and rated it, on average, 79/100 on the System Usability Scale [10] (a standard survey commonly used to subjectively appraise the usability of a product or service) [24]. In our analysis of qualitative comments, reviewers described the tool as easy to use and trustworthy, and appreciated the simple and uncluttered user interface [10]. When used to assist the second reviewer in a pair (a semi-automated approach to screening), across three systematic reviews on average only 1% (range, 0 to 2%) of relevant studies (i.e., those included in the nal reviews) were missed [10].
To screen in Abstrackr, all records retrieved by the searches must rst be uploaded to the system. Once the records are uploaded, titles and abstracts appear one at a time on the user interface, and the reviewer is prompted to label each as 'relevant', 'irrelevant', or 'borderline'. While screening, Abstrackr learns from the reviewer's labels and other data via active learning and dual supervision [23]. In active learning, the reviewer must rst screen a 'training set' of records from which the model learns to distinguish between those that are relevant or irrelevant based on common features (i.e., words or combinations of words) [23]. In dual supervision, the reviewers can communicate their knowledge of the review task to the model by tagging terms that are indicative of relevance or irrelevance (e.g., the term 'trial' may be imparted as relevant in systematic reviews that seek to include only trials) [23]. After screening a training set, the review team can view and download Abstrackr's relevance predictions for records that have not yet been screened. The predictions are presented to reviewers in two ways: a numeric value representing the probability of relevance (0 to 1) and a binary 'hard' screening prediction (true or false, i.e., relevant or irrelevant).

Data collection
Screening simulation. For each review, we uploaded all records retrieved by the searches to Abstrackr for screening. We used the single-reviewer and random citation order settings, and screened a 200-record training set for each review by retrospectively replicating the senior reviewer's original screening decisions. Abstrackr's ability to learn and accurately predict the relevance of candidate records depends on the correct identi cation and labeling of relevant and irrelevant records in the training set. Replicating the senior reviewer's decisions optimized the probability of a good quality training set. Although the optimal training set size is not known [7], the developers of a similar tool recommend a training set that includes at least 40 excluded and 10 included records, up to a maximum of 300 records [25].
For systematic reviews completed at our centre, any record marked as 'include' (i.e., relevant) or 'unsure' (i.e., borderline) by either of two independent reviewers at the title-abstract screening stage is eligible for scrutiny by full text. For this reason, our screening les typically include one of two screening decisions per record: 'include/unsure' (relevant) or 'exclude' (irrelevant). Because we could not ascertain retrospectively whether the decision for each record was 'include' or 'unsure', we entered all 'include/unsure' decisions as 'relevant' in Abstrackr. We did not use the 'borderline' decision.
After screening the training set, we downloaded the predicted relevance of the remaining records. Typically, these became available within 24 hours. In instances where the predictions did not become available in 48 hours, we continued to screen in batches of 100 records until they did. We used the hard screening predictions instead of applying custom thresholds based on the relevance probabilities for each remaining record. In the absence of guidance on the optimal threshold to apply, using the hard screening predictions was likely realistic of how the tool is used by review teams.
Although potentially prone to bias, the liberal-accelerated screening approach [18,19] saves time in traditional systematic reviews even without the use of ML. In this approach, any record marked as 'include' or 'unsure' by either of two independent reviewers automatically moves forward to full text screening. Only records marked as 'exclude' by one reviewer are screened by a second reviewer to con rm or refute their exclusion. The time consuming step of achieving consensus at the title-abstract level becomes irrelevant and is omitted.
Building on earlier ndings from a similar sample of reviews [16], we devised a retrospective screening simulation to investigate the bene ts and risks of using ML in combination with the liberal accelerated screening approach, compared with traditional dual independent screening. In this simulation, after screening a training set of 200 records, the senior reviewer would download the predictions and continue screening only those that were predicted to be relevant. The second reviewer would screen only the records excluded either by the senior reviewer or predicted to be irrelevant by Abstrackr to con rm or refute their exclusion. This simulation was relevant only to the systematic reviews, for which dual independent screening had been undertaken. Since a single reviewer completed study selection for the rapid reviews, retrospectively simulating liberal-accelerated screening for these reviews was not possible.
Differences in review results. To investigate differences in the results of systematic reviews when relevant studies are omitted, for systematic reviews with pairwise meta-analyses we re-ran the analyses for the primary outcomes of effectiveness omitting the studies that would have been erroneously excluded from the nal reports via the semi-automated liberal accelerated simulation. We investigated differences in the results only of systematic reviews with pairwise meta-analyses because the appraisal of this outcome among reviews with qualitative or quantitative narrative syntheses was not feasible within available time and resources. When the primary outcomes were not explicitly reported, we considered any outcome for which certainty of evidence appraisals were reported to be primary outcomes. Otherwise, we considered the rst reported outcome to be the primary outcome.
Characteristics of missed studies. We pooled the data for the studies included in the nal reports for all reviews to explore which characteristics might be associated with correctly or incorrectly labeling relevant studies. From the nal report for each review, we extracted the risk of bias (low, unclear, or high) and design (trial, observational, mixed methods, qualitative, or review) of each included study. For reviews that included study designs other than randomized trials, we considered methodological quality as inverse to risk of bias. We categorized the risk of bias based on the retrospective quality scores derived from various appraisal tools (Additional le 2). We also documented the year of publication and the impact factor of the journal in which each included study was published based on 2018 data reported on the Journal Citation Reports website (Clarivate Analytics, Philadelphia, Pennsylvania). A second investigator veri ed all extracted data prior to analysis.

Data analysis
We exported the data to SPSS Statistics (v.25, IBM Corporation, Armonk, New York) or StatXact (v.12, Cytel Inc., Cambridge, Massachusetts) for analysis. To evaluate the bene ts and risks of using Abstrackr's predictions in the context of liberal-accelerated screening in systematic reviews we used data from 2x2 cross-tabulations to calculate standard metrics [8], as follows: Proportion missed (error): of the studies included in the nal report, the proportion erroneously excluded during title and abstract screening.
Workload savings (absolute screening reduction): of the records that need to be screened at the title and abstract stage, the proportion that would not need to be screened manually.
Estimated time savings: the estimated time saved by not screening records manually. We assumed a screening rate of 0.5 minutes per record and an 8-hour work day [26].
To determine the effect of missed studies on the results of systematic reviews with pairwise metaanalyses, we compared the pooled effect estimate, 95% con dence interval, and statistical signi cance when missed studies were removed from the meta-analyses to those from the original review. We did not appraise changes in clinical signi cance.
To explore which review, study, and publication characteristics might affect the correctness of Abstrackr's predictions, we rst compared the proportion of studies incorrectly predicted as irrelevant by Abstrackr by review type (i.e., inclusion of only trials vs. multiple study designs; single vs. multiple research questions; systematic review vs. rapid review; complex vs. simple interventions) and by study characteristics (study design (trial, observational, mixed methods, qualitative, review) and risk of bias (low or unclear/high)) via Fischer Exact tests. We compared the mean (SD) year of publication and impact factor of the journals in which studies were published among those that were correctly and incorrectly labeled via unpaired t-tests.
Liberal accelerated screening simulation Table 3 shows the proportion missed, workload savings, and estimated time savings had the reviewers leveraged Abstrackr's predictions and the liberal-accelerated screening approach in each systematic review. Records missed are those that are included in the nal report, but were excluded via the simulated approach at the title-abstract screening stage. To ascertain whether the simulated approach provided any advantage over screening by a single reviewer, we have also included a column showing the number and proportion of studies that the second reviewer would have missed had they screened the records in isolation.
Compared to dual independent screening, for ve (50%) of the systematic reviews no studies were erroneously excluded via our simulated approach. In two (20%) systematic reviews, one record was erroneously excluded, equivalent to 1% of the included records in both reviews. In the remaining three (30%) reviews, three records were erroneously excluded, equivalent to 2 to 14% of the included studies. The simulated approach was advantageous (i.e., fewer records were missed) relative to screening by a single reviewer in six (60%) of the systematic reviews; in many cases, the difference was substantial (e.g., 11% vs. 43% in the Experiences of bronchiolitis review; 1% vs. 11% in the Activity and pregnancy review; 1% vs. 7% in the Treatments for bronchiolitis review; 14% vs. 24% for the VBAC review; 0% vs. 5% in the Brain injury review).

Impact of missed studies on the results
Among the ve systematic reviews where studies were missed, three included pairwise meta-analyses (Activity and pregnancy, Antipsychotics, and Treatment for bronchiolitis) (Additional le 3). The single missed study for each of the Activity and pregnancy and Treatments for bronchiolitis reviews were not included in any of the meta-analyses. It is notable that the missed study in the Activity and pregnancy review was written in Chinese, although it did include an English abstract. Neither of the studies reported on the primary outcomes of their respective systematic reviews. Association of study, review, and publication characteristics with predictions The pooled dataset for the studies included in the 16 nal reports contained 802 records for which Abstrackr had made a prediction (excludes those included in the training sets). Among these, Abstrackr correctly predicted that 696 (87%) were relevant, and incorrectly predicted that 106 (13%) were irrelevant after the 200-record training set.
Review characteristics. Table 4 shows the characteristics of the reviews, strati ed by the correctness of Abstrackr's relevance predictions. Six-hundred-eighty-nine (86%) studies were included across the systematic reviews and 113 (14%) across the rapid reviews. There was no difference (P=0.37) in Abstrackr's ability to correctly predict the relevance of studies by review type (n = 601 (88%) of studies in systematic reviews and 95 (84%) of those in rapid reviews were correctly identi ed).
Two-hundred-ninety-seven (37%) studies were included in reviews that answered a single research question, and 505 (63%) were included in reviews that answered multiple questions. There was a statistically signi cant difference (P=0.01) in Abstrackr's ability to correctly predict the relevance of studies by research question type. Four-hundred-fty (89%) studies in reviews with multiple research questions were correctly identi ed. The proportion of correctly identi ed studies was less (n=246, 83%) in reviews with a single research question.
Four-hundred-three (50%) studies were included in reviews that tested a simple intervention/exposure, and 399 (50%) were included in reviews that tested complex interventions. There was no difference (P=0.47) in Abstrackr's ability to correctly predict the relevance of studies by intervention or exposure type (n=346 (86%) studies in reviews of simple interventions and 350 (88%) studies in reviews of complex interventions were correctly identi ed).
Two-hundred-one (25%) studies were included in reviews that included only one study design (trials or systematic reviews), while the remaining 601 (75%) were included in reviews that included multiple designs (including observational studies). There was a statistically signi cant difference (P=0.003) in Abstrackr's ability to correctly predict the relevance of studies by included study designs. Abstrackr correctly predicted the relevance of 122 (95%) studies in reviews that included only randomized trials as compared to 57 (79%) and 517 (86%) in reviews that included only systematic reviews, or multiple study designs, respectively.
Of the 620 studies for which we had risk of bias details, 120 (19%) were at low and 500 (81%) were at unclear or high overall risk of bias. There was a statistically signi cant difference (P=0.039) in Abstrackr's ability to correctly predict the relevance of included studies by risk of bias. Abstrackr correctly predicted the relevance of 438 (88%) of studies at unclear or high risk of bias as compared to 96 (80%) of those at low risk of bias.
Publication characteristics. Table 6 shows the characteristics of the publications, strati ed by Abstrackr's relevance predictions. Across all studies, the mean (SD) publication year was 2008 (7

Discussion
Compared with dual independent screening, leveraging Abstrackr's predictions in combination with a liberal-accelerated screening approach resulted in few, if any, missed records (between 0 and 3 records per review, or 0 to 14 percent of those included in the nal reports). The missed records would not have changed the conclusions for the main effectiveness outcomes in the impacted reviews; moreover, as we have previously shown it is likely that in the context of a comprehensive search, missed studies would be identi ed by other means (e.g., reference list scans) [16]. The workload savings were substantial, and despite being not quite as e cient, considerably fewer studies were missed compared to screening by a single reviewer in many (60%) reviews. Included studies were correctly identi ed more frequently among reviews that included multiple research questions (vs. single) and those that included only randomized trials (vs. only reviews, or multiple study designs). Correctly identi ed studies were more likely to be randomized trials, mixed methods, and qualitative studies (vs. observational studies and systematic reviews).
As part of our previous work, we simulated four additional methods whereby we could leverage Abstrackr's predictions to expedite screening, including fully automated and semi-automated approaches [16]. The simulation that provided the best balance of reliability and workload savings was a semiautomated second screener approach, based on an algorithm rst reported by Wallace et al. in 2010 [26].
In this approach, the senior reviewer would screen a 200-record training set and continue to screen only those that Abstrackr predicted to be relevant. The second reviewer would screen all records as per usual. The second reviewer's decisions and those of the senior reviewer and Abstrackr would be compared to determine which would be eligible for scrutiny by full text. Among the same sample of reviews, the records that were missed were identical to those in the liberal-accelerated simulation. The median (IQR) workload savings was 2409 (3616) records, equivalent to an estimated time savings of 20 (31) hours or 3 (4) working days. Thus, compared to the semi-automated second screener approach [26], the liberalaccelerated approach resulted in marginally greater workload and time savings without compromising reliability.
In exploring the screening tasks for which ML-assisted screening might be best suited, some of our ndings were paradoxical. For example, studies were more often correctly identi ed as relevant in systematic reviews with multiple research questions (vs. a single question). There was no difference in the proportion of studies correctly identi ed as relevant among systematic reviews that investigated complex vs. simple interventions. There are likely a multitude of interacting factors that affect Abstrackr's predictions, including the size and composition of the training sets. More research is needed to inform a framework to assist review teams in deciding when or when not to use ML-assisted methods. Our ndings are consistent with other studies which have suggested that ML may be particularly useful for expediting simpler review tasks (e.g., differentiating trials from studies of other designs) [27], leaving more complex decisions to human experts. Cochrane's RCT Classi er, which essentially automates the identi cation of trials, is one example of such an approach [27]. By automatically excluding 'obviously irrelevant' studies, human reviewers are left to screen only those where screening decisions are more ambiguous.
Our data suggest that combining Abstrackr's early predictions with the liberal-accelerated screening method may be an acceptable approach in reviews where the limited risk of missing a small number of records is acceptable (e.g., some rapid reviews), or the outcomes are not critical. This may be true for some scoping reviews, where the general purpose is to identify and map the available evidence [28], rather than synthesize data on the effect of an intervention on one or more outcomes. Even for systematic reviews, the recently updated Cochrane Handbook states that the selection of studies should be undertaken by two reviewers working independently, but that a single reviewer is acceptable for titleabstract screening [29]. Similarly, the AMSTAR 2 tool, used to judge the con dence in the results of systematic reviews [30], states that title-abstract screening by a single reviewer is acceptable if good agreement (at least 80%) between two reviewers was reached during pilot testing. The ML-assisted approach that we have proposed provides a good compromise between dual independent screening (most rigorous) and single-reviewer screening (less rigorous) for review teams looking to save time and resource while maintaining or exceeding the methodological rigour expected for high quality systematic reviews [29,30].
When conceptualizing the relative advantages of semi-automatic title-abstract screening, it will be important to look beyond study selection to other tasks that may bene t from the associated gains in e ciency. For example, published systematic reviews frequently report limits to the searches (e.g., limited databases, published literature only) and eligibility criteria (e.g., trials only, English language only) [31], both of which can have implications for the conclusions of the review. If studies can be selected more e ciently, review teams may choose to broaden their searches or eligibility criteria, potentially missing fewer studies even if a small proportion are erroneously omitted through semi-automation.
Given the retrospective nature of most studies, the semi-automation of different review tasks have largely been studied as isolated processes. Prospective studies are needed to bridge the gap between hypothetical opportunities and concrete demonstrations of the risks and bene ts of various novel approaches. For example, recently a full systematic review was completed in two weeks by a research team in Australia using a series of semi-automated and manual processes [32]. The authors reported on the facilitators and barriers to their approaches [32]. To build trust, beyond replication of existing studies it will be important for review teams to be able to conceptualize, step-by-step, how ML can be integrated into their standard procedures [13] and under what circumstances the bene ts of different approaches outweigh the inherent risks. As a starting point, prospective direct comparisons of systematic reviews completed with and without ML-assisted methods would be helpful to encourage adoption. There may be ways to incorporate such evaluations into traditional systematic review processes without substantially increasing reviewer burden.

Strengths and limitations
This is one of few studies to report on the potential impact of ML-assisted title-abstract screening on the results and conclusions of systematic reviews, and to explore the correctness of predictions by review, study, and publication-level characteristics. Our ndings are potentially prone to selection bias, as we evaluated a convenient sample of reviews for which adequate data were available for analysis; however, we selected the reviews prior to knowing the results or how they would impact the ndings of the present study, and no preferential criteria were applied. Although many tools and methods are available to semiautomate title-abstract screening, we used only Abstrackr and simulated a liberal-accelerated approach.
The ndings should not be generalized to other tools or approaches. Moreover, we used relatively small training sets in an attempt to maximize e ciency. It is possible that different training sets would have yielded more or less favourable results. The retrospective screening results for the rapid reviews are more prone to error and bias compared with those for the systematic reviews because a single reviewer completed study selection. Since a machine learning tool's predictions can only be as good as the data on which it was trained, it is possible that the study selection method used for the rapid reviews differentially impacted the accuracy of the predictions. Our ndings related to the changes to a review's conclusions are applicable only to the reviews with pairwise meta-analyses, of which there were few.
Further, because so few studies were missed, we were only able to assess for changes to eight metaanalyses in one systematic review. The ndings should be interpreted cautiously and not extrapolated to reviews with other types of syntheses (e.g, narrative). Because our evaluation was retrospective, we estimated time savings based on a screening rate of two records per minute. Although ambitious, this rate allowed for more conservative estimates of time savings and for comparisons to previous studies that have used the same rate [10,15,16].

Conclusions
Our ML-assisted screening approach saved considerable time and may be suitable in contexts where the limited risk of missing relevant records is acceptable (e.g., some rapid reviews). ML-assisted screening may be most trustworthy for reviews that seek to include only trials; however, as several of our ndings are paradoxical further study is needed to understand the contexts in which ML-assisted screening is best suited. Prospective evaluations will be important to fully understand the implications of adopting MLassisted systematic review methods, build con dence among systematic reviewers, and to gather reliable Authors' contributions. AG contributed to the design of the study, data collection, veri cation, and analysis, and drafted the report. MG contributed to the design of the study, data collection and veri cation, and reviewed the report. DD contributed to data collection and analysis, and reviewed the report. SAE contributed to the design of the study and reviewed the report. JP contributed to the design of the study and reviewed the report. SR contributed to data veri cation and reviewed the report. BV contributed to data analysis and veri cation, and reviewed the report. LH contributed to the design of the study, oversaw all aspects of the research, and reviewed the report. All authors read and approved the nal manuscript.
Acknowledgements. We thank the following investigators for allowing access to their review data: Aireen Wingert, Megan Nuspl, Dr. Margie Davenport, and Dr. Vickie Plourde. We thank Dr. Meghan Sebastianski and Samantha Guitard for their contributions to data entry.  b The training set was 200 records for all reviews except Diabetes and Visual Acuity, for which it was 300. Table 3. Proportion missed, workload savings, and estimated time savings for each systematic review a