Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature

Table 7 Metrics per label using the top-k retrieved categories

Model	{P,R,MAP}@1 (%)	P@3 (%)	R@3 (%)	MAP@3 (%)
RoBERTa_base	65.99	27.10	81.29	72.69
RoBERTa_large	67.29	28.12	84.37	74.86
BioBERT	68.55	28.63	85.89	76.16
PubMedBERT	68.33	28.47	85.42	75.92
COVID-Twitter-BERT	64.98	27.88	83.64	73.14
Ensemble	70.57	29.69	89.07	78.92

P precision, R recall, MAP mean average precision. As this is a single-label task, the max value for P@3 is 1/3 (33%)

ISSN: 2046-4053