We developed a binary ML classifier with the aim of reducing screening workload for the CCSR. Calibrated to achieve 99% recall, the classifier reduced screening workload by 24.1% in the evaluation data set. This finding was especially encouraging given the proportion of eligible records in this data set was close to 50%; and almost one in five of the records were ‘title-only’, with relatively few text features for classification, compared to records with accompanying abstracts. Title-only records in the context of the COVID pandemic can be resource- and time-intensive to manually assess. For many, more information will need to be found before a judgement on whether the record is eligible can be made. Having a classifier able to reliably reject ineligible title-only records is valuable and will free up human resource to assess the more unclear title-only records.
One of the main strengths of this study is the quality of the three data sets. We were able to use highly representative records for each stage, with a high level of confidence in the quality of each, derived as they were from the Cochrane Centralised Search Service team and Cochrane Crowd . In addition, the training data set was fairly large (n=59,513), made up of both the class of interest (‘included’) and non-eligible records (‘excluded’). Records within the class of interest set encompassed all eligible study types (observational, interventional, qualitative and modelling studies) and designs and had good coverage across the range of possible study aims.
A potential limitation is that most records comprising each of the three study data sets were sourced from PubMed (of which a large proportion are also likely to have been indexed in Embase). This is unlikely to be an issue when applying the classifier to bibliographic records of journal articles identified from other database sources; but caution would be needed when applying the classifier to records with a different structure, for example, trial registry records. While many trial registry records contain similar information to a standard bibliographic record that could, in principle, be parsed and added to the title-abstract records prior to their classification, it is important to be aware of which fields map well to each other across the different record types, and in some cases to exclude certain fields of information that might confuse the classifier—such as trial exclusion criteria. As such, further work would be needed to evaluate the performance of this classifier when applied to records incorporating selected text from trial registry records. We could also investigate the potential to incorporate such records into sets used to retrain and recalibrate periodically updated versions of this classifier.
In this paper, we have focused on reporting the deployment of a machine learning classifier in a real-world scenario over a short period of time. The method employed, using train, test and calibration data sets and easily interpretable probabilities from a logistic regression classifier, provides a robust basis for future work and has proved acceptable to Cochrane. A workload reduction of ~25% is substantial given the high recall that must be achieved. However, we do not rule out that deployment of more sophisticated machine learning classification algorithms may be able to push the reported savings in workload marginally higher.
Evolution in the scope, aims and topics and text features of COVID-19 research over time suggest that ML classifiers which, like this one, that have been prospectively developed, are likely to need to be periodically retrained, recalibrated and re-evaluated, in order to minimise the risk of ‘losing’ (or ‘missing’) new bodies (or ‘strands’) of relevant research, with new ‘previously unseen’ text features, that are likely to emerge as the pandemic continues to unfold. Periodically updated training, calibration and evaluation data sets should be prospectively assembled to comprise records from three consecutive time periods, as we have done in the current study. This approach is robust in terms of its external validity, as it is consistent with the real-world use scenario in which such classifiers are deployed, where we do not know in advance how the research literature will evolve following their (re-) deployment. (Re-)calibrating and (re-)evaluating the classifier using records from consecutive time periods immediately succeeding the one covered by records in the
(re-)training dataset therefore confers further confidence (alongside the size and breadth of our study datasets) that any subtle evolution or ‘shifts’ in the scope and text features of bibliographic records of published COVID-19 research over time are unlikely to adversely impact on the performance of the deployed classifier in the short-term.
In late January 2021, the classifier developed in this study was deployed in the Cochrane COVID-19 register workflow, with records retrieved from PubMed and Embase.com being run through it. Workload reduction in terms of screening effort has been reduced in practice by approximately 20–25%, which is in line with the expected reduction based on this study. The classifier is also being used to help prioritise screening by ordering the records that score above the cut-point from highest to lowest score. Feedback from the screening team has indicated that records that receives high scores are almost always eligible studies, but they are often not the higher priority interventional studies. This is very likely due to the high prevalence of observational studies in the data sets used.
The Cochrane COVID-19 Study Classifier reduces screening burden by cutting the number of excludes to assess by approximately half. This is a helpful start but with the proportion of records eligible being around 50% (as it has been for the last 6 months for the CCSR), an ‘exclusion’ classifier can only do so much. In addition, the rate of publication on COVID-19 shows no sign of slowing with the number of new studies identified for the CCSR averaging 4600 per month over the last 6 months. Therefore, we are now developing additional automated approaches to maintain the CCSR. With over 60,000 COVID-19-related studies identified and tagged in the register, we are developing additional ML classifiers that will assign or suggest both study design characteristics and study aims to potentially eligible studies. We are also developing automated approaches to assigning PICO characteristics to interventional studies. Here, we will use crowd and ML capabilities in a hybrid approach to keeping up with the deluge of publications on COVID-19.