Similarity Entropy-Based Self-Adaptive String Outlier Detection Method

Volume 13, Number 4, July 2017 - Paper 10 - pp. 427-436
DOI: 10.23940/ijpe.17.04.p10.427436

Ou Ye and Zhanli Li

School of Computer Science and Technology, Xi'an University of Science and Technology, Xi'an, 710054, Shannxi, China

(Submitted on February 27, 2017; Revised on April 2, 2017; Accepted on May 17, 2017)


Although a large variety of outlier detection techniques have been developed, the algorithms pay less attention to the impact of structure factor on semantics for string data, and the threshold is difficult to be given automatically with unknown distribution law of string data, so the accuracy of string outlier detection is difficult to be ensured. This paper presents a similarity entropy-based self-adaptive string outlier detection method to address this issue. Firstly, semantic similarity is calculated by matrix computation based on word matching, and structure similarity is calculated by considering the structure factors. On this basis, string data is mapped into similarity cells, and they are detected to identify outlier data by using similarity distance. In order to reduce the sensitivity problem of threshold, the similarity entropy histogram is constructed to determine the dynamic threshold. The simulation experiments are conducted to prove the feasibility and rationality of this method, and the results show that this method can reduce sensitivity problem of threshold and ensure accuracy.


References: 13

