Machine learning reduced workload for the Cochrane COVID-19 Study Register: development and evaluation of the Cochrane COVID-19 Study Classifier

Table 3 Key characteristics of development, calibration and evaluation data sets

Data set (classifier development stage)	Size	Number of eligible records (%)	Number of title-only records (%)	Number of title-only records that were eligible (%)	Provenance of records
Data set 1 (Training)	59,513	20,878 (35.1%)	18,669 (31.4%)	4495 (21.5%)	3229 (5.4%)—Embase 2083 (3.5%)—preprint 54201 (91.1%)—PubMed
Data set 2 (Calibration)	16,123	6005 (37.2%)	3626 (22.5%)	821 (13.7%)	1994 (12.4%)—Embase 287 (1.8%)—pre-print 13842 (85.8%)—PubMed
Data set 3 (Evaluation)	4722	2310 (48.9%)	896 (19.0%)	285 (12.3%)	89 (1.9%)—Embase 202 (4.3%)—pre-print 4431 (93.8%)—PubMed

ISSN: 2046-4053