Username   Password       Forgot your password?  Forgot your username? 

A Novel Information Theory-Based Ensemble Feature Selection Framework for High-Dimensional Microarray Data

Volume 13, Number 5, September 2017 - Paper 17  - pp. 742-753
DOI: 10.23940/ijpe.17.05.p17.742753

Jie Caia, Jiawei Luoa,*, Cheng Liangb, ShengYanga

aCollege of Computer Science and Electronic Engineering, Hunan University, Changsha,  410082, Hunan, China
bSchool of Information Science and Engineering, Shandong Normal University, Jinan,  250358, Shangdong, China

(Submitted on March 8, 2017; Revised on July 1, 2017; Accepted on August 27, 2017)


Ensemble feature selection is one of the ensemble learning methods, where each classifier is trained or built by feature selection result. Ensemble feature selection is an effective way for dealing with high dimension and small sample data, such as microarray data. However, ensemble feature selection should achieve more accurate and stable classification performance. In this paper, we present a novel diversity measure based on information theory called Sum of Minimal Information Distance (SMID), which maximizes the relevance between feature subsets and class label as well as the diversity between feature subsets. Moreover, a novel ensemble feature selection framework satisfying this criterion is proposed. In this framework, features that have more mutual information with class label and more diversity between each other are retained. Different feature subsets are used to train base classifiers after being obtained by incremental search method, and then these classifiers are aggregated into a consensus classifier by majority voting. Comparing with three representative feature selection methods and five ensemble learning methods on ten microarray datasets, the experiment results show that the proposed method achieves better performance than the other methods in terms of the classification accuracy.


References: 30

    1. T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, and Y. Saeys, "Robust Biomarker Identification for Cancer Diagnosis with Ensemble Feature Selection Methods," Bioinformatics, vol. 26, no.3, pp. 392-398, 2010.
    2. D. W. Aha, D. Kibler, and M. K. Albert, "Instance-based Learning Algorithms," Machine learning, vol. 6, no. 1, pp. 37-66, 1991.
    3. H. Ahn, H. Moon, M. J. Fazzari, N. Lim, J. J. Chen, and R. L. Kodell, "Classification by Ensembles from Random Partitions of High-Dimensional Data," Computational Statistics & Data Analysis, vol. 51, no. 12, pp. 6166-6179, 2007.
    4. V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos, "An Ensemble of Filters and Classifiers for Microarray Data Classification," Pattern Recognition, vol. 45, no. 1, pp. 531-539, 2012.
    5. V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos, "Data Classification Using an Ensemble of Filters," Neurocomputing, vol. 135, no. 135, pp. 13–20, 2014.
    6. L. Breiman, "Random Forests," Machine learning, vol. 45, no. 1, pp. 5-32, 2001.
    7. N. Cristianini and J. Shawe-Taylor, "An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods," Cambridge university press, 2000.
    8. K. W. De Bock, K. Coussement, and D. Van den Poel, "Ensemble Classification Based on Generalized Additive Models," Computational Statistics & Data Analysis, vol. 54, no. 6, pp. 1535-1546, 2010.
    9. T. G. Dietterich, "Ensemble Methods in Machine Learning," in International workshop on multiple classifier systems, pp. 1-15, 2000.
    10. F. Fleuret, "Fast Binary Feature Selection with Conditional Mutual Information," Journal of Machine Learning Research, vol. 5, pp. 1531-1555, 2004.
    11. T. K. Ho, "The Random Subspace Method for Constructing Decision Forests," IEEE transactions on pattern analysis and machine intelligence, vol. 20, no. 8, pp. 832-844, 1998.
    12. K. B. Irani,"Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning," In  Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022-1027, 1993.
    13. A. Krogh and J. Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning," Advances in neural information processing systems, vol. 7, pp. 231-238, 1995.
    14. L. I. Kuncheva and C. J. Whitaker, "Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy," Machine learning, vol. 51, no. 2, pp. 181-207, 2003.
    15. C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. de Schaetzen, R. Duque, H. Bersini, and A. Nowe, "A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 9, no. 4, pp. 1106-1119, 2012.
    16. H. Liu, L. Liu, and H. Zhang, "Ensemble Gene Selection for Cancer Classification," Pattern Recognition, vol. 43, no. 8, pp. 2763-2772, 2010.
    17. D. W. Opitz, "Feature Selection for Ensembles," AAAI/IAAI, pp. 379-384, 1999.
    18. H. Peng, F. Long, and C. Ding, "Feature Selection Based on Mutual Information Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy," IEEE transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226-1238, 2005.
    19. Y. Piao, M. Piao, K. Park, and K. H. Ryu, "An Ensemble Correlation-Based Gene Selection Algorithm for Cancer Classification with Gene Expression Data," Bioinformatics, vol. 28, no. 24, pp. 3306-3315, 2012.
    20. J. R. Quinlan, "C4. 5: Programs for Machine Learning," Morgan Kaufmann Publishers Inc, 1993.
    21. M. Reboiro-Jato, F. Díaz, D. Glez-Peña, and F. Fdez-Riverola, "A Novel Ensemble of Classifiers that Use Biological Relevant Gene Sets for Microarray Classification," Applied Soft Computing, vol. 17, pp. 117-126, 2014.
    22. J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso, "Rotation Forest: A New Classifier Ensemble Method," IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 28, no. 10, pp. 1619-1630, 2006.
    23. Y. Saeys, T. Abeel, and Y. Van de Peer, "Robust Feature Selection Using Ensemble Feature Selection Techniques," in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 313-325, september 2008.
    24. N. X. Vinh, J. Epps, and J. Bailey, "Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance," Journal of Machine Learning Research, vol. 11, pp. 2837-2854, 2010.
    25. Witten I H, Frank E, "Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations," San Francisco: Morgan Kaufmann Publishers, 2000.
    26. Yang H H and Moody J E, "Data Visualization and Feature Selection: New Algorithms for Nongaussian Data," in NIPS, pp. 687-693, 1999.
    27. L. Yijing, G. Haixiang, L. Xiao, L. Yanan, and L. Jinling, "Adapted Ensemble Classification Algorithm Based on Multiple Classifier System and Feature Selection for Classifying Multi-Class Imbalanced Data," Knowledge-Based Systems, vol. 94, pp. 88-104, 2016.
    28. L. Zhang and P. N. Suganthan, "Random Forests with Ensemble of Feature Spaces," Pattern Recognition, vol. 47, no. 10, pp. 3429-3437, 2014.
    29. Z.H. Zhou, "Ensemble methods: Foundations and Algorithms," CRC press, 2012.
    30. Q. Zou, J. Zeng, L. Cao, and R. Ji, "A Novel Features Ranking Metric with Application to Scalable Visual and Bioinformatics Data Classification," Neurocomputing, vol. 173, pp. 346-354, 2016.



      Please note : You will need Adobe Acrobat viewer to view the full articles.Get Free Adobe Reader

      This site uses encryption for transmitting your passwords.