Skip to main content

Table 4 F1-score performance for both the models and ensemble across all the subclasses

From: Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

Label

F1-score (%)

RoBERTa base

RoBERTa large

BioBERT

PubMedBERT

COVID-Twitter

Ensemble

EPI

88.17

88.05

88.38

88.70

87.26

89.47

BASIC

78.15

78.85

79.20

78.13

78.36

80.47

OTHER

78.44

79.22

79.86

80.72

76.71

81.97a

micro avg

84.01

84.26

84.68

84.99

82.99

86.10a

macro avg

81.59

82.04

82.48

82.51

80.77

83.97a

  1. aStatistically significant improvement