Username   Password       Forgot your password?  Forgot your username? 


Similarity based on the Importance of Common Features in Random Forest

Volume 15, Number 4, April 2019, pp. 1171-1180
DOI: 10.23940/ijpe.19.04.p12.11711180

Xiao Chena,b, Li Hana, Meng Lenga, and Xiao Panc

aNetwork Technology Center, Hebei Normal University of Science and Technology, Qinhuangdao, 066004, China
bQianan College, North China University of Science and Technology, Qianan, 064400, China
cCollege of Economic and Management, Shijiazhuang Tiedao University, Shijiazhuang, 050043, China


(Submitted on November 16, 2018; Revised on December 12, 2018; Accepted on January 6, 2019)


In the existing methods for calculating the similarity between samples in random forests, the only case considered is where different samples fall on the same leaf node of the decision tree. The cases where there are leaf nodes in different positions of the decision tree or the sample falls on different leaves are neglected, thus affecting the accuracy of the similarity. In this paper, firstly, according to the difference of the leaf nodes in different positions of the decision tree, the importance of the sample features to which the leaf nodes belong are used as an attribute to describe the similarity. Secondly, for the case that the samples fall on different leaf nodes, the common features between samples are taken as another attribute to describe the similarity. Therefore, the measure method SICF (similarity between samples based on the importance of common features) is proposed. Finally, it is applied to the K-nearest neighbor classification algorithm, and the validity and correctness of the similarity are verified by the OOB index. The experimental results show that for the UCI data set, compared with two classical methods, the similarity SICF achieves better classification results.

References: 18

    1. S. Shan, “Decision Tree Learning,” New York: Springer US, pp. 1-28, February 2016
    2. A. S. Nugroho, A. B. Witarto, and D. Handoko, “Support Vector Machine,” New York: Springer US, pp. 24-52, 2016
    3. K. Adi, C. E. Widodo, A. P. Widodo, et al., “Naïve Bayes Algorithm for Lung Cancer Diagnosis using Image Processing Techniques,” Advanced Science Letters, Vol. 23, No. 3, pp. 2296-2298, March 2017
    4. L. Breiman, “Random Forest,” Machine Learning, Vol. 45, No. 1, pp. 5-32, January 2001
    5. T. K. Ho, “The Random Subspace Method for Constructing Decision Forests,” IEEE Transactions on Pattern Analysis & Machine Intelligence, Vol. 20, No. 8, pp. 832-844, August 1998
    6. D. Wang, Y. L. Chen, X. D. Cai, et al., “Person Re-Identification based on Random Forest and RankSVM Optimization,” Video Engineering, Vol. 39, No. 18, pp. 90-93, September 2015
    7. Y. H. Qiu, “Customer Loss Prediction in Telecom Industry based on Pruning Random Forest,” Journal of Xiamen University (Natural Science Edition), Vol. 53, No. 6, pp. 817-823, June 2014
    8. Q. F. Zhou, W. C. Hong, and F. Yang, “Feature Selection based on Difference Random Forest Similarity Matrix,” Journal of Huazhong University of Science and Technology (Natural Science Edition), Vol. 38, No. 4, pp. 58-61, April 2010
    9. H. Wang and H. Z. Yan, “Similar Performance Intrusion Detection Algorithm based on Random Forest Computing,” Information Security and Communication Secrecy, Vol. 2009, No. 8, pp. 70-73, August 2009
    10. Y. Dong, B. Du, and L. Zhang, “Target Detection based on Random Forest Metric Learning,” IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing, Vol. 8, No. 4, pp. 1830-1838, April 2017
    11. L. Huang, Y. Jin, and Y. Gao, “Longitudinal Clinical Score Prediction in Alzheimer’s Disease with Soft Split Sparse Regression based on Random Forest,” Neurobiology of Aging, Vol. 46, No. 10, pp. 180-183, October 2016
    12. S. S. Matin and S. C. Chelgani, “Estimation of Coal Gross Calorific Value based on Various Analyses by Random Forest Method,” Fuel, Vol. 177, No. 8, pp. 274-278, August 2016
    13. K. R. Gray, P. Aljabar, and R. A. Heckemann, “Random Forest-based Similarity Measures for Multimodal Classification of Alzheimer’s Disease,” Neuroimage, Vol. 65, No. 1, pp. 167-175, January 2013
    14. Y. Qi, J. K. Seetharaman, and Z. B. Joseph, “Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources,” Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing, Vol. 10, pp. 531-542, 2005
    15. H. Y. Lu, M. Zhang, and Y. Q. Liu, “Feature Importance Analysis and Enhanced Feature Selection Model of Convolutional Neural Networks,” Journal of Software, Vol. 28, No. 11, pp. 2879-2890, November 2017
    16. D. Zhang, Q. Wang, B. Zhu, et al., “Pedestrian Recognition using the Importance of Human Body Features,” Journal of Wuhan University (Information Science Edition), Vol. 42, No. 1, pp. 84-90, January 2017
    17. Z. G. Li, “Several Studies on The Improvement of Random Forest,” Xiamen: Masters Thesis of Xiamen University, pp. 18-27, 2014
    18. Y. Y. Chen, J. Q. Wu, and K. J. Xu, “Attribute Splitting Method based on Gini Index in Decision Tree,” Microcomputer Development, Vol. 14, No. 15, pp. 66-68, July 2004


    Please note : You will need Adobe Acrobat viewer to view the full articles.Get Free Adobe Reader

    This site uses encryption for transmitting your passwords.