Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
Systematic Reviews volume 8, Article number: 317 (2019)
The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for class imbalance to identify the outperforming strategy to screen articles in PubMed for inclusion in systematic reviews.
We trained four binary text classifiers (support vector machines, k-nearest neighbor, random forest, and elastic-net regularized generalized linear models) in combination with four techniques for class imbalance: random undersampling and oversampling with 50:50 and 35:65 positive to negative class ratios and none as a benchmark. We used textual data of 14 systematic reviews as case studies. Difference between cross-validated area under the receiver operating characteristic curve (AUC-ROC) for machine learning techniques with and without preprocessing (delta AUC) was estimated within each systematic review, separately for each classifier. Meta-analytic fixed-effect models were used to pool delta AUCs separately by classifier and strategy.
Cross-validated AUC-ROC for machine learning techniques (excluding k-nearest neighbor) without preprocessing was prevalently above 90%. Except for k-nearest neighbor, machine learning techniques achieved the best improvement in conjunction with random oversampling 50:50 and random undersampling 35:65.
Resampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred.
The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews (SRs). The use of text mining (TM) tools and machine learning techniques (MLT) to aid citation screening is becoming an increasingly popular approach to reduce human burden and increase efficiency to complete SRs [1,2,3,4,5,6].
Thanks to its 28 million citations, PubMed is the most prominent free online source for biomedical literature, continuously updated and organized in a hierarchical structure that facilitates article identification . When searching through PubMed by using keyword queries, researchers usually retrieve a minimal number of papers relevant to the review question and a higher number of irrelevant papers. In such a situation of imbalance, most common machine learning classifiers, used to differentiate relevant and irrelevant texts without human assistance, are biased towards the majority class and perform poorly on the minority one [8, 9]. Mainly, three sets of different approaches can be applied to deal with imbalance . The first is the pre-processing data approach. With this approach, either majority class samples are removed (i.e., undersampling techniques), or minority class samples are added (i.e., oversampling techniques), to make the data more balanced before the application of an MLT [8, 10]. The second type of approaches is represented by the set of algorithmic ones, which foresee cost-sensitive classification, i.e., they put a penalty to cases misclassified in the minority class, this with the aim to balance the weight of false positive and false negative errors on the overall accuracy . Third approaches are represented by the set of ensemble methods, which apply to boosting and bagging classifiers both resampling techniques and penalties for misclassification of cases in the minority class [12, 13].
This study examines to which extent class imbalance challenges the performance of four traditional MLTs for automatic binary text classification (i.e., relevant vs irrelevant to a review question) of PubMed abstracts. Moreover, the study investigates whether the considered balancing techniques may be recommended to increase MLTs accuracy in the presence of class imbalance.
We considered the 14 SRs used and described in . The training datasets contain the positive and negative citations retrieved from the PubMed database, where positives were the relevant papers finally included in each SR. To retrieve positive citations, for each SR, we ran the original search strings using identical keywords and filters. From the set of Clinical Trial article type (according to PubMed filter), we selected negative citations by adding the Boolean operator NOT to the original search string (see Fig. 1). The whole set of these negative citations was then sampled up to retain a minimum ratio of 1:20 (positives to negatives).
Further details on search strings and records retrieved in PubMed can be found in the supplementary material in . The search date was the 18 July 2017. For each document (n = 7,494), information about the first author, year, title, and abstract were collected and included in the final dataset.
We applied the following text pre-processing procedures to the title and abstract of each retrieved citation: each word was converted to lowercase, non-words were removed, stemming was applied, whitespaces were stripped away, and bi-grams were built and considered as a single token like a single word. The whole collection of tokens was finally used to get 14 document-term matrices (DTMs), one for each SR. The DTMs were initially filled by the term frequency (TF) weights, i.e., the simple counting number of each token in each document. The sparsity (i.e., the proportion of zero entries in the matrix) of the DTM was always about 99% (see Table 1). Term frequency-inverse document frequency (TF-IDF)  weights were used both for reducing the dimensionality of the DTMs by retaining the tokens ranked in the top 4% and as features used by the classifiers. The TF-IDF weights where applied to DTMs during each cross-validation (CV) step, accordingly to the same process described in .
We selected four commonly used classifiers in TM: support vector machines (SVMs) , k-nearest neighbor (k-NN) , random forests (RFs) , and elastic-net regularized generalized linear models (GLMNet) . SVM and k-NN are among the most widely used MLTs in the text classification with low computational complexity . Although computationally slower, RFs have also proved effective in textual data classification . We selected GLMNets as benchmark linear model classifiers .
Dealing with class imbalance
Random oversampling (ROS) and random undersampling (RUS) techniques were implemented to tackle the issue of class imbalance . RUS removes the majority samples randomly from the training dataset to the desired ratio of the minority to majority classes. Since it reduces the dimensionality of the training dataset, it reduces the overall computational time as well, but there is no control over the information being removed from the dataset . ROS adds the positive samples, i.e., the ones in the minority class, randomly in the dataset with replacement up to the desired minority to majority class ratio in the resulting dataset.
We included two different ratios for the balancing techniques: 50:50 and 35:65 (the minority to the majority). The standard ratio considered is the 50:50. On the other hand, we also examined the 35:65 ratio as suggested in .
The 20 modeling strategies resulting from any combination of MLTs (SVM, k-NN, RF, GLMNet), balancing techniques (RUS, ROS), and balancing ratios (50:50, 35:65) plus the ones resulting from the application of MLTs without any balancing technique were applied to the SRs reported in .
Fivefold CV was performed to train the classifier. The area under receiver operating characteristic curve (AUC-ROC) was calculated for each of the ten random combinations of the tunable parameters of the MLTs. The considered parameters were the number of variables randomly sampled as candidates for the trees to be used at each split for RF, the cost (C) of constraints violation for SVM, the regularization parameter (lambda) and the mixing parameter (alpha) for GLMNet, and the neighborhood size (k) for k-NN. The parameters with the best cross-validated AUC-ROC were finally selected.
RUS and ROS techniques were applied to the training dataset. However, the validation data set was held out before using the text preprocessing and balancing techniques to avoid possible bias in the validation . The whole process is represented in Fig. 2.
To compare the results, separately for each MLT, we computed the within SR difference between the cross-validated AUC-ROC values resulting from the application of four balancing techniques (i.e., RUS and ROS both considering 50:50 and 35:65 possible balancing ratios) and the AUC-ROC resulting from the crude application of the MLT (i.e., by the “none” strategy to managing the unbalanced data). For all those delta AUCs, we computed 95% confidence intervals, estimated by the observed CV standard deviations and sample sizes. Next, we pooled the results by MLT using meta-analytic fixed-effect models. To evaluate the results, 16 forest plots were gridded together with MLTs by rows and balancing techniques by columns, in Fig. 3.
Table 2 reports cross-validated AUC-ROC values for each strategy, stratified by SR. In general, all the strategies achieved a very high cross-validated performance. Regarding the methods to handle class imbalance, ROS-50:50 and RUS-35:65 reported the best results. The application of no balancing technique resulted in a high performance only for the k-NN classifiers. Notably, for k-NN, the application of any method for class imbalance dramatically hampers its performance. A gain is observed for GLMnet and RF when coupled with a balancing technique. Conversely, no gain is observed for SVM.
Meta-analytic analyses (see Fig. 3) show a significant improvement of the GLMNet classifier while using any strategy to manage the imbalance (minimum delta AUC of + 0.4 with [+ 0.2, + 0.6] 95% CI, reached using ROS-35:65). Regarding the application of strategies in combination with k-NN, all of them drastically and significantly hamper the performance of the classifier in comparison with the use of the k-NN alone (maximum delta AUC of − 0.38 with [− 0.39, − 0.36] 95% CI reached using RUS-50:50). About the RF classifier, the worst performance was reached using ROS-50:50 which is the only case the RF did not show a significant improvement (delta AUC + 0.01 with [− 0.01, + 0.03] 95% CI); in all the other cases, the improvements were significant. Last, the use of an SVM in combination with strategies to manage the imbalance shows no clear pattern in the performance, i.e., using RUS-50:50, the performance decreases significantly (delta AUC − 0.13 with [− 0.15, − 0.11] 95% CI); ROS-35:65 does not seem to have any effect (delta AUC 0.00 with [− 0.02, + 0.02] 95% CI); for both ROS-50:50 and RUS-35:56, the performance improves in the same way (delta AUC 0.01 with [− 0.01, + 0.03] 95% CI), though not significantly.
Application of MLTs in TM has proven to be a potential model to automatize the literature search from online databases [1,2,3,4,5]. Although it is difficult to establish any overall conclusions about best approaches, it is clear that efficiencies and reductions in workload are potentially achievable .
This study compares different combinations of MLTs and pre-processing approaches to deal with the imbalance in text classification as part of the screening stage of an SR. The aim of the proposed approach is to allow researchers to make comprehensive SRs, by extending existing literature searches from PubMed to other repositories such as ClinicalTrials.gov, where documents with a comparable word charactezisation could be accurately identified by the classifier trained on PubMed, as illustrated in . Thus, for real-world applications, researchers must conduct the search string on citational databases, make the selection of studies to include in the SR, and add negative operator to the same search string to retrieve the negative citations. Next, they can use the information retrieved from the selected studies to train a ML classifier to apply on the corpus of the trials retrieved from ClinicalTrials.gov.
Regardless of the balancing techniques applied, all the MLTs considered in the present work have shown the potential to be used for the literature search from the online databases with AUC-ROCs across the MLTs (excluding k-NN) ranging prevalently above 90%.
Among study findings, the resampling pre-processing approach showed a slight improvement in the performance of the MLTs. ROS-50:50 and RUS-35:65 techniques showed the best results in general. Consistent with the literature, the use of k-NN does not seem to require any approach for imbalance . On the other hand, for straightforward computational reasons directly related to the decrease in the sample size of the original dataset, the use of RUS 35:65 may be preferred. Moreover, k-NN showed unstable results when data had been balanced using whatever technique. It is also worth noting that k-NN-based algorithms returned an error, with no results, three times out of the 70 applications, while no other combination of MLT and pre-processing method encountered any errors. The problem occurred only in the SR of Kourbeti  which is the one with the highest number of records (75 positives and 1600 negatives), and only in combination with one of the two ROS techniques or when no technique was applied to handle unbalanced data, i.e., when the dimensionality does not decrease. The issue is known (see for instance the discussion in https://github.com/topepo/caret/issues/582) when using the caret R interface to MLT algorithms, and manual tuning of the neighborhood size could be a remedy .
According to the literature, the performance of various MLTs was found sensitive to the application of approaches for imbalanced data [11, 26]. For example, SVM with different kernels (linear, radial, polynomial, and sigmoid kernels) was analysed on a genomics biomedical text corpus using resampling techniques and reported that normalized linear and sigmoid kernels and the RUS technique outperformed the other approaches tested . SVM and k-NN were also found sensitive to the class imbalance in the supervised sentiment classification . Addition of cost-sensitive learning and threshold control has been reported to intensify the training process for models such as SVM and artificial neural network, and it might provide some gains for validation performances, not confirmed in the test results .
However, the high performance of MLTs in general and when no balancing techniques were applied are not in contrast with the literature. The main reason could be that each classifier is already showing good performance without the application of methods to handle unbalanced data, and there is no much scope left for the improvement. A possible explanation for such a good performance lies in the type of the training set and features, where positives and negatives are well-separated by design, and based on search strings performing word comparison into the metadata of the documents . Nevertheless, the observed small relative gain in performance (around 1%) may translate into a significant absolute improvement depending on the intended use of the classifier (i.e., an application on textual repositories with millions of entries).
Study findings suggest that there is not an outperforming strategy to recommend as a convenient standard. However, the combination of SVM and RUS-35:65 may be suggested when the preference is for a fast algorithm with stable results and low computational complexity related to the sample size reduction.
Other approaches to handle unbalanced data could also be investigated, such as the algorithmic or the ensemble ones. Also, we decided to embrace the data-driven philosophy of ML and compare the different methods without any a priori choice and manual tuning of the specific hyper-parameter for each technique. This is with the final aim of obtaining reliable and not analyst-dependent results.
Resampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred.
Availability of data and materials
Original data are publicly available, and the manuscript contains the description on how to retrieve them. Visit https://github.com/UBESP-DCTV/costumer for further information.
Area under receiver operating characteristic curve
Generalized linear model net
Inverse document frequency
Machine learning technique
Support vector machine
Thomas J, Noel-Storr A, Marshall I, et al. Living systematic reviews: 2. Combining human and machine effort. J Clin Epidemiol. 2017;91:31–7.
Khabsa M, Elmagarmid A, Ilyas I, et al. Learning to identify relevant studies for systematic reviews using random forest and external information. Mach Learn. 2016;102:465–82.
Marshall IJ, Noel-Storr A, Kuiper J, et al. Machine learning for identifying randomized controlled trials: an evaluation and practitioner’s guide. Res Synth Methods:0. Epub ahead of print January 2018. https://doi.org/10.1002/jrsm.1287.
Wallace BC, Noel-Storr A, Marshall IJ, et al. Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach. J Am Med Inform Assoc. 2017;24:1165–8.
Miwa M, Thomas J, O’Mara-Eves A, et al. Reducing systematic review workload through certainty-based screening. J Biomed Inform. 2014;51:242–53.
O’Mara-Eves A, Thomas J, McNaught J, et al. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev. 2015;4:5.
Kritz M, Gschwandtner M, Stefanov V, et al. Utilization and perceived problems of online medical resources and search tools among different groups of European physicians. J Med Internet Res; 15, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3713956/ (2013, Accessed 22 Sept 2017).
Wallace BC, Trikalinos TA, Lau J, et al. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics. 2010;11:55.
Longadge R, Dongre S. Class imbalance problem in data mining review. ArXiv Prepr ArXiv13051707, https://arxiv.org/abs/1305.1707 (2013).
Liu AY. The effect of oversampling and undersampling on classifying imbalanced text datasets. Univ Tex Austin, https://pdfs.semanticscholar.org/cade/435c88610820f073a0fb61b73dff8f006760.pdf (2004).
Laza R, Pavón R, Reboiro-Jato M, et al. Evaluating the effect of unbalanced data in biomedical document classification. J Integr Bioinforma. 2011;8:105–17.
Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. IEEE:324–31.
Lanera C, Minto C, Sharma A, et al. Extending PubMed searches to ClinicalTrials.gov through a machine learning approach for systematic reviews. J Clin Epidemiol. 2018;103:22–30.
Naderalvojoud B, Bozkir AS, Sezer EA. Investigation of term weighting schemes in classification of imbalanced texts. Lisbon: Proceedings of European Conference on Data Mining (ECDM). p. 15–7.
Lessmann S. Solving imbalanced classification problems with support vector machines: IC-AI. p. 214–20.
Tan S. Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst Appl. 2005;28:667–71.
Jindal R, Malhotra R, Jain A. Techniques for text classification: literature review and current trends. Webology. 2015;12:1.
Shardlow M, Batista-Navarro R, Thompson P, et al. Identification of research hypotheses and new knowledge from scientific literature. BMC Med Inform Decis Mak. 2018;18:46.
Zheng T, Xie W, Xu L, et al. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inf. 2017;97:120–7.
Khoshgoftaar TM, Seiffert C, Van Hulse J, et al. Learning with limited minority class data. In: Machine Learning and Applications, 2007. ICMLA 2007. Sixth International Conference on. IEEE, pp. 348–353.
Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. Springer series in statistics New York, http://statweb.stanford.edu/~tibs/book/preface.ps (2001, accessed 30 Aug 2017).
KNN approach to unbalanced data distributions: a case study involving information extraction | BibSonomy, https://www.bibsonomy.org/bibtex/2cf4d2ac8bdac874b3d4841b4645a5a90/diana (accessed 4 Sept 2018).
Kourbeti IS, Ziakas PD, Mylonakis E. Biologic therapies in rheumatoid arthritis and the risk of opportunistic infections: a meta-analysis. Clin Infect Dis Off Publ Infect Dis Soc Am. 2014;58:1649–57.
Wing MKC from J, Weston S, Williams A, et al. caret: Classification and Regression Training, https://CRAN.R-project.org/package=caret (2017).
Mountassir A, Benbrahim H, Berrada I. An empirical study to address the problem of unbalanced data sets in sentiment classification. In: Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on: IEEE. p. 3298–303.
González RR, Iglesias EL, Diz LB. Applying balancing techniques to classify biomedical documents: an empirical study. Int J Artif Intell. 2012;8:186–201.
Liu S, Forss T. Text classification models for web content filtering and online safety. In: Data Mining Workshop (ICDMW), 2015 IEEE International Conference on: IEEE. p. 961–8.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work was performed during an internship of Abhinav Sharma at the Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of Padova, Via Loredan 18, 35131 Padova, Italy.
About this article
Cite this article
Lanera, C., Berchialla, P., Sharma, A. et al. Screening PubMed abstracts: is class imbalance always a challenge to machine learning?. Syst Rev 8, 317 (2019). https://doi.org/10.1186/s13643-019-1245-8
- Indexed search engine
- Machine learning
- Text mining
- Unbalanced data, systematic review