This study found that Abstrackr has the potential to reliably identify relevant citations and reduce workload from 9 to 80 %. In two datasets, all relevant citations were identified, and in the other two datasets, only one citation was missed. The false negative rate ranged from 2 to 21 %. Overall, precision was good although affected by the complexity of the review.
In the aHUS dataset, precision was only 16.8 %. This was due to the complexities of the study inclusion criteria which included case series as well as other higher quality study designs but not case reports that were excluded during screening. Because of the lexical similarity between case reports and case series, excluding case reports introduced greater variance into the machine learning algorithm with apparent conflicting decisions and consequently reduced precision. The sensitivity analysis demonstrated that by reducing ‘noise’, the precision could be increased. This problem of ‘noise’ with machine learning is common [21], and one strategy to increase precision during the data-training phase is to include close matching records [2], to ensure the active learning algorithm is not adversely affected, although this requires a degree of expertise to make decisions contrary to the PICOS (Participants, Interventions, Comparators, Outcomes and Study design) inclusion criteria. The ECHO sensitivity analysis had the worst precision (6.8 %) because of the 15,920 citations wherein there was only about 0.9 % that was relevant. Such imbalanced datasets are problematic for supervised machine learning models like Abstrackr, because the predictions are biased towards the majority non-relevant class at the expense of the minority-relevant class [22] and therefore produce many falsely weighted predictions, i.e. irrelevant citations. Nevertheless, this was off-set by the considerable workload saving.
The false negative rates ranged from 2 to 21.7 % and represent the percentage of citations that were relevant for further full text inspection but were predicted to be irrelevant by Abstrackr and were therefore ‘missed’. However, the actual percentage missed were in the range of 0 to 0.21 % and represent the true final proportion of citation missed by Abstrackr that were included in the review. Therefore, the classification model was almost completely reliable. The citation missed from the aHUS and ECHO datasets did not contain an abstract, only a title and therefore the probability of being predicted relevant was reduced. The aHUS sensitivity analysis missed two citations, and both contained no abstract. The ECHO sensitivity analysis missed two citations, one without an abstract, whilst the other did contain an abstract and it is unclear why this citation was not detected as relevant. However, these problems could be minimised by retaining citations without an abstract for manual inspection.
The complexity of the review PICOS criteria also affected the workload saving. The workload saving in the rituximab dataset was low (9 %) due to the rituximab intervention having multiple adjunctive chemotherapy treatments which overlapped with non-relevant studies. Therefore, the good precision and perfect recall accuracy with the rituximab data were off-set by the minimal workload savings suggesting that complex reviews may be less suited to semi-automated screening. Nevertheless, the average workload saving across the four datasets was 41 % and is similar to the findings reported by the developers of Abstrackr who achieved a 40 % saving in workload from two datasets [14].
Other data mining algorithms have achieved similar (40 %) workload savings [16] but recall (identifying relevant records) was lower (90 to 95 %), partly because testing was performed on datasets without a specifically associated research question. This makes comparisons with the results of this study difficult. Whilst another text mining algorithm [17] achieved workload savings ranging from 8.5 to 62 % with 15 test datasets, which are similar to our findings with Abstrackr (9 to 80 %), their results were based on a threshold of a minimum 95 % recall of relevant studies, which is too low for systematic reviews. The developers of Abstrackr reported a recall accuracy of 100 % for relevant studies from three genetics-related datasets and 99 % for a fourth dataset, whilst the average specificity across the four datasets was 87 % [14]. Their results were based on training the algorithm with balanced datasets, which have a similar number of relevant and irrelevant trials from the original systematic review, and using this trained algorithm to automatically find studies for the updates of the genetics-based systematic reviews. This approach is noteworthy since systematic reviews require update searches to be performed within 2 years of the first published version [23], therefore, implementing this strategy, by retaining the original classification model, would expedite the process of updating systematic reviews.
Strengths and weaknesses of the research
Our findings may be limited by the four datasets used, and citations from other clinical specialities may yield different precision and workload saving and miss more relevant studies for inclusion, especially if the title and abstract descriptions are inadequate or the study designs are more complex. Our datasets were from recently published systematic reviews that included trials published mostly from 1995 onwards, and therefore, may contain better descriptions than older trials that were published before the CONSORT [24] reporting guidelines were introduced in 1996. Nevertheless, our results for identifying relevant trials are similar to the high recall results of Wallace (2010 and 2012) and indicate that similar accuracy could be achieved when using other datasets of medical citations. Previous text mining studies have mainly evaluated performance in terms of recall and specificity; however, our results also analysed the precision of the predictive model since this measures how precisely the algorithm selects studies for further full text inspection and mirrors the working steps of a systematic reviewer. Precision, however, is subjective and influenced by the reviewer’s expertise which can affect their screening judgements. The ECHO sensitivity analysis demonstrated that workload saving with semi-automated screening is more pronounced with large datasets, and therefore, greater savings could have resulted had we screened larger reviews. Nonetheless, the results provide a reasonable estimate of the algorithm’s typical performance during semi-automated screening.
This study and others that have evaluated semi-automated screening with support vector models [14, 15], semantic vector models [16], and complement naïve Bayes models [17] indicate that considerable workload savings can be achieved. The ability to identify all relevant citations with Abstrackr was very high but imperfect. Such accuracy, however, is acceptable as a stand-alone tool for scoping searches and non-systematic reviews where not every published study needs to be included. It is noteworthy, however, that human citation screening is imperfect with relevant studies wrongly excluded [25]. Given that Abstrackr’s inaccuracy is similar to a human screener, it could be utilised as the second screener. Abstrackr’s classification prediction model uses a somewhat arbitrary cutoff point at which the proportion of citations screened triggers the algorithm prediction. However, this suggests that an adjustable stopping heuristic could be used, so accuracy could be further improved albeit with the trade-off that more citations are screened during the training phase.
Future developments with semi-automated screening would benefit from retaining the original classification model developed during the original review, so future systematic review updates may be screened automatically without the re-input of a reviewer. Abstrackr is not currently a fully integrated tool, and only the unscreened citations (the predictions) are exportable with only the title bibliographic details made available, and further developments are needed to create a fully integrated tool that systematic reviewers and information specialists can use.
Text mining algorithms that would enable systematic reviewers to exploit the labelling of keywords towards biasing the predictive classification model to further enhance performance have been proposed [26]. This approach could be further aided by citation enrichment. For example, keywords of high relevance such as the PICOS details should improve the recall accuracy of semi-automated screening algorithms (and trial searching). Enriching citations is already being used for the EMBASE project [27] by coding citations with the type of study design through crowd sourcing. Further research and innovations in this underexplored area is needed to advance current methods to eventually enable semi-automated screening to fully replace manual screening. Current text mining research [28] is focused on advancing screening retroactively and is restrained by the limitations of the data available. A more successful approach may require collaboration with biomedical database providers to ensure that citations are adequately labelled prospectively and retrospectively using strategies such as record linkage techniques, crowd sourcing, or access to a central repository, whereby PICOS details can be inputted and linked to all bibliographic databases.