Skip to main content

Table 2 A summary of included extraction methods and their evaluation

From: Automating data extraction in systematic reviews: a systematic review

Study

Extracted elements

Dataset

Method

Sentence/Concept/Neither

Full text (F)/Abstract (A)

Results

Dawes et al. (2007) [12]

PECODR

20 evidence-based medicine journal synopses (759 extracts from the corresponding PubMed abstracts)

Proposed potential lexical patterns and assessed using NVIvo software

Neither

Abstract

Agreement among the annotators was 86.6 and 85 %, which rose up to 98.4 and 96.9 % after consensus. No automated system.

Kim et al. (2011) [13]

PIBOSO

1000 medical abstracts (PIBOSO corpus)

Conditional random fields with various features based on lexical, semantic, structural and sequential information

Sentence

Abstract

Micro-averaged F-scores on structured and unstructured: 80.9 and 66.9 %, 63.1 % on an external dataset

Boudin et al. (2010) [16]

PICO (I and C were combined together)

26,000 abstracts from PubMed, first sentences from the structured abstract

Combination of multiple supervised classification algorithms: random forests (RF), naive Bayes (NB), support vector machines (SVM), and multi-layer perceptron (MLP)

Sentence

Abstract

F-score of 86.3 % for P, 67 % for I (and C), and 56.3 % for O

Huang et al. (2011) [17]

PICO (except C)

23,472 sentences from the structured abstracts

naïve Bayes

Sentence

Abstract

F-measure of 0.91 for patient/problem, 0.75 for intervention, and 0.88 for outcome

Verbeke et al. (2012) [18]

PIBOSO

PIBOSO corpus

Statistical relational learning with kernels, kLog

Sentence

Abstract

Micro-averaged F of 84.29 % on structured abstracts and 67.14 % on unstructured abstracts

Huang et al. (2013) [19]

PICO (except C)

19,854 structured abstracts of randomized controlled trials

First sentence of the section or all sentences in the section, NB classifier

Sentence

Abstract

First sentence of the section: F-scores for P: 0.74, I: 0.66, and O: 0.73

All sentences in the section: F-scores for P: 0.73, I: 0.73, and O: 0.74

Hassanzadeh et al. (2014) [20]

PIBOSO (Population-Intervention-Background-Outcome-Study Design-Other)

PIBOSO corpus, 1000 structured and unstructured abstracts

CRF with discriminate set of features

Sentence

Abstract

Micro-averaged F-score: 91

Robinson (2012) [21]

Patient-oriented evidence: morbidity, morality, symptom severity, quality of life

1356 PubMed abstracts

SVM, NB, multinomial NB, logistic regression

Sentence

Abstract

Best results achieved via SVM: F-measure of 0.86

Chung (2009) [22]

Intervention, comparisons

203 RCT abstracts for training and 124 for testing

Coordinating constructs are identified using a full parser, which are further classified as positive or not using CRF

Sentence

Abstract

F-score: 0.76

Hara and Matsumoto (2007) [23]

Patient population, comparison

200 abstracts labeled as ‘Neoplasms’ and ‘Clinical Trial, Phase III’

Categorizing noun phrases (NPs) into classes such as ‘Disease’, ‘Treatment’ etc. using CRF and use regular expressions on the sentence with classified Noun Phrases

Sentence

Abstract

F-measure of 0.91 for the task of noun phrase classification. Results of sentence classification: F-,measure of 0.8 for patient population and 0.81 for comparisons

Davis-Desmond and Molla (2012) [42]

Detecting statistical evidence

194 randomized controlled trial abstracts from PubMed

Rule-based classifier using negation expressions

Sentence

Abstract

Accuracy: between 88 and 98 % at 95 % CI

Zhao et al. (2012) [24]

Patient, result, Intervention, Study Design, Research Goal

19,893 medical abstracts and full text articles from 17 journal websites

Conditional random fields

Sentence

Full text

F-scores for sentence classification: patient: 0.75, intervention: 0.61, result: 0.91, study design: 0.79, research goal: 0.76

Hsu et al. (2012) [25]

Hypothesis, statistical method, outcomes and generalizability

42 full-text papers

Regular expressions

Sentence

Full text

For classification task, F-score of 0.86 for hypothesis, 0.84 for statistical method, 0.9 for outcomes, and 0.59 for generalizability

Song et al. (2013) [26]

Analysis (statistical facts), general (generally accepted facts), recommend (recommendations about interventions), rule (guidelines)

346 sentences from three clinical guideline document

Maximum entropy (MaxEnt), SVM, MLP, radial basis function network (RBFN), NB as classifiers and information gain (IG), genetic algorithm (GA) for feature selection

Sentence

Full text

F-score of 0.98 for classifying sentences

Demner-Fushman and Lin (2007) [28]

PICO (I and C were combined)

275 manually annotated abstracts

Rule-based approach to identify sentence containing PICO and supervised classifier for Outcomes

Concept

Abstract

Precision of 0.8 for population, 0.86 for problem, 0.80 for intervention, 0.64–0.95 for outcome

Kelly and Yang (2013) [29]

Age of subjects, duration of study, ethnicity of subjects, gender of subjects, health status of subjects, number of subjects

386 abstracts from PubMed obtained with the query ‘soy and cancer’

Regular expressions, gazetteer

Concept

Abstract

F-scores for age of subjects: 1.0, duration of study: 0.911, ethnicity of subjects: 0.949, gender of subjects: 1.0, health status of subjects: 0.874, number of subjects: 0.963

Hansen et al. (2008) [30]

Number of trial participants

233 abstracts from PubMed

Support vector machines

Concept

Abstract

F-measure: 0.86

Xu et al. (2007) [32]

Subject demographics such as subject descriptors, number of participants and diseases/symptoms and their descriptors

250 randomized controlled trial abstracts

Text classification augmented with hidden Markov models was used to identify sentences; rules over parse tree to extract relevant information

Sentence, concept

Abstract

Precision for subject descriptors: 0.83 %, number of trial participants: 0.923, diseases/symptoms: 51.0 %, descriptors of diseases/symptoms: 92.0 %

Summerscales et al. (2009) [34]

Treatments, groups and outcomes

100 abstracts from BMJ

Conditional random fields

Concept

Abstract

F-scores for treatments: 0.49, groups: 0.82, outcomes: 0.54

Summerscales et al. (2011) [35]

Groups, outcomes, group sizes, outcome numbers

263 abstracts from BMJ between 2005 and 2009

CRF, MaxEnt, template filling

Concept

Abstract

F-scores for groups: 0.76, outcomes: 0.42, group sizes: 0.80, outcome numbers: 0.71

Kiritchenko et al. (2010) [36]

Eligibility criteria, sample size, drug dosage, primary outcomes

50 full-text journal articles with 1050 test instances

SVM classifier to recover relevant sentences, extraction rules for correct solutions

Concept

Full text

P5 precision for the classifier: 0.88, precision and recall of the extraction rules: 93 and 91 %, respectively

Lin et al. (2010) [39]

Intervention, age group of the patients, geographical area, number of patients, time duration of the study

93 open access full-text literature documenting oncological and cardio-vascular studies from 2005 to 2008

Linear chain, conditional random fields

Concept

Full text

Precision of 0.4 for intervention, 0.63 for age group, 0.44 for geographical area, 0.43 for number of patients and 0.83 for time period

Restificar et al. (2012) [37]

Eligibility criteria

44,203 full-text articles with clinical trials

Latent Dirichlet allocation along with logistic regression

Concept

Full text

75 and 70 % accuracy based on similarity for inclusion and exclusion criteria, respectively.

De Bruijn et al. (2008) [40]

Eligibility criteria, sample size, treatment duration, intervention, primary and secondary outcomes

88 randomized controlled trials full-text articles from five medical journals

SVM classifier to identify the most promising sentences; manually crafted weak extraction rules for the information elements

Sentence, concept

Full text

Precision for eligibility criteria: 0.69, sample size: 0.62, treatment duration: 0.94, intervention: 0.67, primary outcome: 1.00, secondary outcome: 0.67

Zhu et al. (2012) [41]

Subject demographics: patient age, gender, disease and ethnicity

50 randomized controlled trials full-text articles

Manually crafted rules for extraction from the parse tree

Concept

Full text

Disease extraction: for exact matching, the F-score was 0.64. For partially matched, it was 0.85.

Marshall et al. (2014) [27]

Risk of bias concerning sequence generation, allocation concealment and blinding

2200 clinical trial reports

Soft-margin SVM for a joint model of risk of bias prediction and supporting sentence extraction

Sentence

Full text

For sentence identification: F-score of 0.56, 0.48, 0.35 and 0.38 for random sequence generation, allocation concealment, blinding of participants and personnel, and blinding of outcome assessment