Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

Table 5 F1-score performance for both the models and ensemble across all the sub-subclasses

Label	F1-score (%)
Label	RoBERTa base	RoBERTa large	BioBERT	PubMedBERT	COVID-Twitter	Ensemble
EPI: Case report	83.91	84.70	86.55	84.65	81.97	86.85
EPI: Case series	62.76	62.30	65.12	63.42	58.60	65.37
EPI: Case–control study	31.79	40.98	35.51	36.80	32.65	39.02
EPI: Cohort study	51.26	53.18	52.85	56.33	48.68	54.10
EPI: Cross-sectional study	59.89	65.46	66.19	64.10	62.01	65.46
EPI: Diagnostic study	67.01	66.32	65.81	63.83	64.77	69.61
EPI: Ecological study	41.27	41.51	46.53	46.81	42.33	46.46
EPI: Guidelines	57.28	60.32	59.01	60.65	56.26	62.52
EPI: Modelling study	87.61	86.51	87.78	87.05	88.15	88.43^a
EPI: Other	21.34	19.33	17.82	17.54	17.61	21.33
EPI: Outbreak or surveillance report	32.81	30.71	30.30	32.28	33.99	38.30
EPI: Qualitative study	20.41	31.75	35.29	40.00	33.33	36.73
EPI: Review	66.44	65.94	67.59	66.22	63.77	70.78^a
EPI: Trial	56.76	60.76	73.68	68.35	55.70	71.60
BASIC: Animal experiment	65.12	71.91	57.53	57.89	57.78	72.29
BASIC: Basic research review	19.92	24.60	16.67	13.10	18.64	23.15
BASIC: Biochemical/protein structure studies	60.72	63.48	62.39	64.03	58.13	65.67
BASIC: In vitro experiment	36.36	48.75	41.61	44.05	42.77	46.36
BASIC: Sequencing and phylogenetics	68.68	66.94	72.06	69.64	67.33	70.08
BASIC: Within-host modelling	0.00	11.76	0.00	10.53	13.64	11.11
OTHER: Other	17.39	16.95	20.56	20.11	15.25	19.32
OTHER: Comment, editorial, …, non-original	78.28	79.22	79.54	80.79	76.83	82.03^a
micro avg	65.85	66.89	67.38	67.40	64.69	69.50^a
macro avg	49.41	52.43	51.84	52.19	49.55	54.84^a

ISSN: 2046-4053