Skip to main content

Table 3 Key characteristics of development, calibration and evaluation data sets

From: Machine learning reduced workload for the Cochrane COVID-19 Study Register: development and evaluation of the Cochrane COVID-19 Study Classifier

Data set (classifier development stage)

Size

Number of eligible records (%)

Number of title-only records (%)

Number of title-only records that were eligible (%)

Provenance of records

Data set 1 (Training)

59,513

20,878 (35.1%)

18,669 (31.4%)

4495 (21.5%)

3229 (5.4%)—Embase

2083 (3.5%)—preprint

54201 (91.1%)—PubMed

Data set 2 (Calibration)

16,123

6005 (37.2%)

3626 (22.5%)

821 (13.7%)

1994 (12.4%)—Embase

287 (1.8%)—pre-print

13842 (85.8%)—PubMed

Data set 3 (Evaluation)

4722

2310 (48.9%)

896 (19.0%)

285 (12.3%)

89 (1.9%)—Embase

202 (4.3%)—pre-print

4431 (93.8%)—PubMed