Username   Password       Forgot your password?  Forgot your username? 


A Novel Imbalanced Classification Method based on Decision Tree and Bagging

Volume 14, Number 6, June 2018, pp. 1140-1148
DOI: 10.23940/ijpe.18.06.p5.11401148

Hongjiao Guana, Yingtao Zhanga, Hengda Chengb, and Xianglong Tanga

aSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
bSchool of Computer Science and Technology, Utah State University, Logan, 84322, USA

(Submitted on March 6, 2018; Revised on April 16, 2018; Accepted on May 21, 2018)


Imbalanced classification is a challenging problem in the field of big data research and applications. Complex data distributions, such as small disjuncts and overlapping classes, make traditional methods unable to easily recognize the minority class and thus, lead to low sensitivity. The misclassification costs of the minority class are usually higher than that of the majority class. To deal with imbalanced datasets, typical algorithmic-level methods either introduce cost information or simply rebalance class distribution without considering the distribution of the minority class. In this paper, we propose an optimization embedded bagging (OEBag) approach to increase the sensitivity by learning the complex distributions in the minority class more precisely. By learning these base classifiers, OEBag selectively learns the minority examples that are misclassified easily by referring to examples in out-of-bag. OEBag is implemented by using two specialized under-sampling bagging methods. Nineteen real datasets with diverse levels of classification difficulties are utilized in this paper. Experimental results demonstrate that OEBag performs significantly better in sensitivity and has a great overall performance in terms of AUC (area under ROC curve) and G-mean when compared with several state-of-the-art methods.


References: 22

        1.     A. Asuncion, and D. Newman, ''UCI Machine Learning Repository,'' 2007

        2.     G. E. Batista, R. C. Prati, and M. C. Monard, ''A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,'' ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 20-29, 2004

        3.     J. Błaszczyński, and J. Stefanowski, ''Neighbourhood Sampling in Bagging for Imbalanced Data,'' Neurocomputing, vol. 150, pp. 529-542, 2015

        4.     A. P. Bradley, ''The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms,'' Pattern Recognition, vol. 30, no. 7, pp. 1145-1159, 1997

        5.     M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera, ''Ordering-Based Pruning for Improving the Performance of Ensembles of Classifiers in the Framework of Imbalanced Datasets,'' Information Sciences, vol. 354, pp. 178-196, 2016

        6.     M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, ''A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches,'' Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 42, no. 4, pp. 463-484, 2012

        7.     S. García, A. Fernández, J. Luengo, and F. Herrera, ''Advanced Nonparametric Tests for Multiple Comparisons in the Design of Experiments in Computational Intelligence and Data Mining: Experimental Analysis of Power,'' Information Sciences, vol. 180, no. 10, pp. 2044-2064, 2010

        8.     H. Guan, Y. Zhang, M. Xian, H. Cheng, and X. Tang, ''WENN for Individualized Cleaning in Imbalanced Data,'' In Pattern Recognition (ICPR), 2016 23rd International Conference on, pp. 456-461, IEEE

        9.     Y. Hochberg, ''A Sharper Bonferroni Procedure for Multiple Tests of Significance,'' Biometrika, vol. 75, no. 4, pp. 800-802, 1988

        10.  R. C. Holte, L. Acker, and B. W. Porter, ''Concept Learning and the Problem of Small Disjuncts,'' In IJCAI, pp. 813-818, Citeseer

        11.  N. Japkowicz, and S. Stephen, ''The Class Imbalance Problem: A Systematic Study,'' Intell Data Anal, vol. 6, no. 5, pp. 429-449, 2002

        12.  V. López, A. Fernández, S. García, V. Palade, and F. Herrera, ''An Insight into Classification with Imbalanced Data: Empirical Results and Current Trends on Using Data Intrinsic Characteristics,'' Information Sciences, vol. 250, pp. 113-141, 2013

        13.  C. X. Ling, Q. Yang, J. Wang, and S. Zhang, ''Decision Trees with Minimal Costs,'' In Proceedings of the Twenty-First International Conference on Machine Learning, pp. 69, ACM

        14.  K. Napierala, and J. Stefanowski, ''Types of Minority Class Examples and Their Influence on Learning Classifiers from Imbalanced Data,'' Journal of Intelligent Information Systems, pp. 1-35, 2015

        15.  K. Napierała, J. Stefanowski, and S. Wilk, ''Learning from Imbalanced Data in Presence of Noisy and Borderline Examples,'' In Rough Sets and Current Trends in Computing, pp. 158-167, Springer

        16.  R. C. Prati, G. E. Batista, and M. C. Monard, ''Class Imbalances Versus Class Overlapping: An Analysis of a Learning System Behavior,'' MICAI 2004: Advances in Artificial Intelligence, pp. 312-321, Springer, 2004

        17.  J. A. Sáez, J. Luengo, J. Stefanowski, and F. Herrera, ''SMOTE–IPF: Addressing the Noisy and Borderline Examples Problem in Imbalanced Classification by a Re-Sampling Method with Filtering,'' Information Sciences, vol. 291, no. pp. 184-203, 2015

        18.  H. Shohei, K. Hisashi, and T. Yutaka, ''Roughly Balanced Bagging for Imbalanced Data,'' Statistical Analysis & Data Mining, vol. 2, no. 2, pp. 412-426, 2009

        19.  J. Stefanowski, ''Overlapping, Rare Examples and Class Decomposition in Learning Classifiers from Imbalanced Data,'' Emerging Paradigms in Machine Learning, pp. 277-306, Springer, 2013

        20.  Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, ''Cost-Sensitive Boosting for Classification of Imbalanced Data,'' Pattern Recognition, vol. 40, no. 12, pp. 3358-3378, 2007

        21.  A. Tesfahun and D. L. Bhaskari, ''Intrusion Detection Using Random Forests Classifier with SMOTE and Feature Reduction,'' In International Conference on Cloud & Ubiquitous Computing & Emerging Technologies, pp. 127-132

        22.  H. L. Yu, and J. Ni, ''An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data,'' IEEE-ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 4, pp. 657-666, 2014


              Please note : You will need Adobe Acrobat viewer to view the full articles.Get Free Adobe Reader

              This site uses encryption for transmitting your passwords.