Skip to main content

Table 3 F1-score performance for both the models and ensemble across all the classes

From: Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

Label

F1-score (%)

RoBERTa base

RoBERTa large

BioBERT

PubMedBERT

COVID-Twitter

Ensemble

ORIGINAL

91.06

91.33

91.44

91.94

90.61

92.35

NON-ORIGINAL

78.46

79.19

79.64

80.52

76.72

81.66a

micro avg

87.30

87.70

87.92

88.53

86.46

89.16a

macro avg

84.76

85.26

85.54

86.23

83.66

87.00a

  1. aStatistically significant improvement