From: Automating data extraction in systematic reviews: a systematic review
Study | Extracted elements | Dataset | Method | Sentence/Concept/Neither | Full text (F)/Abstract (A) | Results |
---|---|---|---|---|---|---|
Dawes et al. (2007) [12] | PECODR | 20 evidence-based medicine journal synopses (759 extracts from the corresponding PubMed abstracts) | Proposed potential lexical patterns and assessed using NVIvo software | Neither | Abstract | Agreement among the annotators was 86.6 and 85Â %, which rose up to 98.4 and 96.9Â % after consensus. No automated system. |
Kim et al. (2011) [13] | PIBOSO | 1000 medical abstracts (PIBOSO corpus) | Conditional random fields with various features based on lexical, semantic, structural and sequential information | Sentence | Abstract | Micro-averaged F-scores on structured and unstructured: 80.9 and 66.9Â %, 63.1Â % on an external dataset |
Boudin et al. (2010) [16] | PICO (I and C were combined together) | 26,000 abstracts from PubMed, first sentences from the structured abstract | Combination of multiple supervised classification algorithms: random forests (RF), naive Bayes (NB), support vector machines (SVM), and multi-layer perceptron (MLP) | Sentence | Abstract | F-score of 86.3Â % for P, 67Â % for I (and C), and 56.3Â % for O |
Huang et al. (2011) [17] | PICO (except C) | 23,472 sentences from the structured abstracts | naïve Bayes | Sentence | Abstract | F-measure of 0.91 for patient/problem, 0.75 for intervention, and 0.88 for outcome |
Verbeke et al. (2012) [18] | PIBOSO | PIBOSO corpus | Statistical relational learning with kernels, kLog | Sentence | Abstract | Micro-averaged F of 84.29Â % on structured abstracts and 67.14Â % on unstructured abstracts |
Huang et al. (2013) [19] | PICO (except C) | 19,854 structured abstracts of randomized controlled trials | First sentence of the section or all sentences in the section, NB classifier | Sentence | Abstract | First sentence of the section: F-scores for P: 0.74, I: 0.66, and O: 0.73 |
All sentences in the section: F-scores for P: 0.73, I: 0.73, and O: 0.74 | ||||||
Hassanzadeh et al. (2014) [20] | PIBOSO (Population-Intervention-Background-Outcome-Study Design-Other) | PIBOSO corpus, 1000 structured and unstructured abstracts | CRF with discriminate set of features | Sentence | Abstract | Micro-averaged F-score: 91 |
Robinson (2012) [21] | Patient-oriented evidence: morbidity, morality, symptom severity, quality of life | 1356 PubMed abstracts | SVM, NB, multinomial NB, logistic regression | Sentence | Abstract | Best results achieved via SVM: F-measure of 0.86 |
Chung (2009) [22] | Intervention, comparisons | 203 RCT abstracts for training and 124 for testing | Coordinating constructs are identified using a full parser, which are further classified as positive or not using CRF | Sentence | Abstract | F-score: 0.76 |
Hara and Matsumoto (2007) [23] | Patient population, comparison | 200 abstracts labeled as ‘Neoplasms’ and ‘Clinical Trial, Phase III’ | Categorizing noun phrases (NPs) into classes such as ‘Disease’, ‘Treatment’ etc. using CRF and use regular expressions on the sentence with classified Noun Phrases | Sentence | Abstract | F-measure of 0.91 for the task of noun phrase classification. Results of sentence classification: F-,measure of 0.8 for patient population and 0.81 for comparisons |
Davis-Desmond and Molla (2012) [42] | Detecting statistical evidence | 194 randomized controlled trial abstracts from PubMed | Rule-based classifier using negation expressions | Sentence | Abstract | Accuracy: between 88 and 98Â % at 95Â % CI |
Zhao et al. (2012) [24] | Patient, result, Intervention, Study Design, Research Goal | 19,893 medical abstracts and full text articles from 17 journal websites | Conditional random fields | Sentence | Full text | F-scores for sentence classification: patient: 0.75, intervention: 0.61, result: 0.91, study design: 0.79, research goal: 0.76 |
Hsu et al. (2012) [25] | Hypothesis, statistical method, outcomes and generalizability | 42 full-text papers | Regular expressions | Sentence | Full text | For classification task, F-score of 0.86 for hypothesis, 0.84 for statistical method, 0.9 for outcomes, and 0.59 for generalizability |
Song et al. (2013) [26] | Analysis (statistical facts), general (generally accepted facts), recommend (recommendations about interventions), rule (guidelines) | 346 sentences from three clinical guideline document | Maximum entropy (MaxEnt), SVM, MLP, radial basis function network (RBFN), NB as classifiers and information gain (IG), genetic algorithm (GA) for feature selection | Sentence | Full text | F-score of 0.98 for classifying sentences |
Demner-Fushman and Lin (2007) [28] | PICO (I and C were combined) | 275 manually annotated abstracts | Rule-based approach to identify sentence containing PICO and supervised classifier for Outcomes | Concept | Abstract | Precision of 0.8 for population, 0.86 for problem, 0.80 for intervention, 0.64–0.95 for outcome |
Kelly and Yang (2013) [29] | Age of subjects, duration of study, ethnicity of subjects, gender of subjects, health status of subjects, number of subjects | 386 abstracts from PubMed obtained with the query ‘soy and cancer’ | Regular expressions, gazetteer | Concept | Abstract | F-scores for age of subjects: 1.0, duration of study: 0.911, ethnicity of subjects: 0.949, gender of subjects: 1.0, health status of subjects: 0.874, number of subjects: 0.963 |
Hansen et al. (2008) [30] | Number of trial participants | 233 abstracts from PubMed | Support vector machines | Concept | Abstract | F-measure: 0.86 |
Xu et al. (2007) [32] | Subject demographics such as subject descriptors, number of participants and diseases/symptoms and their descriptors | 250 randomized controlled trial abstracts | Text classification augmented with hidden Markov models was used to identify sentences; rules over parse tree to extract relevant information | Sentence, concept | Abstract | Precision for subject descriptors: 0.83Â %, number of trial participants: 0.923, diseases/symptoms: 51.0Â %, descriptors of diseases/symptoms: 92.0Â % |
Summerscales et al. (2009) [34] | Treatments, groups and outcomes | 100 abstracts from BMJ | Conditional random fields | Concept | Abstract | F-scores for treatments: 0.49, groups: 0.82, outcomes: 0.54 |
Summerscales et al. (2011) [35] | Groups, outcomes, group sizes, outcome numbers | 263 abstracts from BMJ between 2005 and 2009 | CRF, MaxEnt, template filling | Concept | Abstract | F-scores for groups: 0.76, outcomes: 0.42, group sizes: 0.80, outcome numbers: 0.71 |
Kiritchenko et al. (2010) [36] | Eligibility criteria, sample size, drug dosage, primary outcomes | 50 full-text journal articles with 1050 test instances | SVM classifier to recover relevant sentences, extraction rules for correct solutions | Concept | Full text | P5 precision for the classifier: 0.88, precision and recall of the extraction rules: 93 and 91Â %, respectively |
Lin et al. (2010) [39] | Intervention, age group of the patients, geographical area, number of patients, time duration of the study | 93 open access full-text literature documenting oncological and cardio-vascular studies from 2005 to 2008 | Linear chain, conditional random fields | Concept | Full text | Precision of 0.4 for intervention, 0.63 for age group, 0.44 for geographical area, 0.43 for number of patients and 0.83 for time period |
Restificar et al. (2012) [37] | Eligibility criteria | 44,203 full-text articles with clinical trials | Latent Dirichlet allocation along with logistic regression | Concept | Full text | 75 and 70Â % accuracy based on similarity for inclusion and exclusion criteria, respectively. |
De Bruijn et al. (2008) [40] | Eligibility criteria, sample size, treatment duration, intervention, primary and secondary outcomes | 88 randomized controlled trials full-text articles from five medical journals | SVM classifier to identify the most promising sentences; manually crafted weak extraction rules for the information elements | Sentence, concept | Full text | Precision for eligibility criteria: 0.69, sample size: 0.62, treatment duration: 0.94, intervention: 0.67, primary outcome: 1.00, secondary outcome: 0.67 |
Zhu et al. (2012) [41] | Subject demographics: patient age, gender, disease and ethnicity | 50 randomized controlled trials full-text articles | Manually crafted rules for extraction from the parse tree | Concept | Full text | Disease extraction: for exact matching, the F-score was 0.64. For partially matched, it was 0.85. |
Marshall et al. (2014) [27] | Risk of bias concerning sequence generation, allocation concealment and blinding | 2200 clinical trial reports | Soft-margin SVM for a joint model of risk of bias prediction and supporting sentence extraction | Sentence | Full text | For sentence identification: F-score of 0.56, 0.48, 0.35 and 0.38 for random sequence generation, allocation concealment, blinding of participants and personnel, and blinding of outcome assessment |