# Clustering Algorithm of Ethnic Cultural Resources based on Spark
##### Volume 15, Number 3, March 2019, pp. 756-762 DOI: 10.23940/ijpe.19.03.p4.756762
## Ming Lei^{a,b}, Bin Wen^{a}, Jianhou Gan^{b}, and Jun Wang^{b}
^{a}School of Information Science and Technology, Yunnan Normal University, Kunming, 650500, China
^{b}Key Laboratory of Educational Informatization for Nationalities of Ministry of Education, Yunnan Normal University, Kunming, 650500, China (Submitted on October 19, 2018; Revised on November 21, 2018; Accepted on December 23, 2018)
## Abstract:
Extracting valuable information from ethnic cultural resources is the key to current data mining research on ethnic cultural resources. The K-means algorithm can effectively process large-scale data sets due to simple and efficient iterative calculations. The uncertainty of the k-value affects the efficiency and accuracy of the algorithm. The particle swarm optimization (PSO) algorithm and global coarse-grained search can quickly determine the k-value of the cluster center, while the retrieval efficiency is low. In order to solve the problem of the initial clustering center of the K-means algorithm and the low efficiency of the PSO algorithm, this paper proposes a Spark-based PSO-k-means algorithm, which primarily introduces ethnic cultural text resources into the Hadoop Distributed File System (HDFS) and then uses Han Language Processing (HanLP) word segmentation. The Term Frequency-Inverse Document Frequency (TF-IDF) algorithm generates the word frequency vector. Finally, the particle swarm optimization algorithm performs initial pre-clustering on the data set, obtains the K-means algorithm cluster center k, and then obtains the final classification result through K-means algorithm cluster analysis. The experimental results show that the clustering accuracy and stability of the PSO-k-means algorithm are better than those of the existing K-means algorithm on serial stand-alone.
