# Clustering Algorithm of Ethnic Cultural Resources based on Spark
##### Volume 15, Number 3, March 2019, pp. 756-762 DOI: 10.23940/ijpe.19.03.p4.756762
## Ming Lei^{a,b}, Bin Wen^{a}, Jianhou Gan^{b}, and Jun Wang^{b}
^{a}School of Information Science and Technology, Yunnan Normal University, Kunming, 650500, China
^{b}Key Laboratory of Educational Informatization for Nationalities of Ministry of Education, Yunnan Normal University, Kunming, 650500, China (Submitted on October 19, 2018; Revised on November 21, 2018; Accepted on December 23, 2018)
## Abstract:
Extracting valuable information from ethnic cultural resources is the key to current data mining research on ethnic cultural resources. The K-means algorithm can effectively process large-scale data sets due to simple and efficient iterative calculations. The uncertainty of the k-value affects the efficiency and accuracy of the algorithm. The particle swarm optimization (PSO) algorithm and global coarse-grained search can quickly determine the k-value of the cluster center, while the retrieval efficiency is low. In order to solve the problem of the initial clustering center of the K-means algorithm and the low efficiency of the PSO algorithm, this paper proposes a Spark-based PSO-k-means algorithm, which primarily introduces ethnic cultural text resources into the Hadoop Distributed File System (HDFS) and then uses Han Language Processing (HanLP) word segmentation. The Term Frequency-Inverse Document Frequency (TF-IDF) algorithm generates the word frequency vector. Finally, the particle swarm optimization algorithm performs initial pre-clustering on the data set, obtains the K-means algorithm cluster center k, and then obtains the final classification result through K-means algorithm cluster analysis. The experimental results show that the clustering accuracy and stability of the PSO-k-means algorithm are better than those of the existing K-means algorithm on serial stand-alone.
**References: 21**
- H. C. Li, X. P. Wu, and Y. Chen, “K-Means Clustering Method Supporting Differential Privacy Protection under MapReduce Framework,”
*Journal on Communications*, Vol. 37, No. 2, pp. 124-130, 2016
- A. Bolfazlis, S. Anaeiz, and A. Hmede, “Cloud-based Augmentation for Mobile Devices: Motivation, Taxonomies, and Open challenges,”
*IEEE Communications Surveys and Tutorials*, Vol. 16, No. 1, pp. 337-368, 2014
- Y. Shen, D. H. Yu, and W. L. Wang, “Improvement of Particle Swarm K-means Clustering Algorithm,”
*Computer Engineering and **Applications*, Vol. 50, No. 21, pp. 125-128, 2014
- B. Wang and X. J. Yu, “Parallel K-Means Clustering Algorithm for Adaptive Cuckoo Search,”
*Application Research of Computers*, Vol. 3503, pp. 675-679, 2018
- G. H. Zhu, S. B. Huang, C. F. Yuan, and Y. H. Huang, “SCoS: Design and Implementation of Parallel Spectral Clustering Algorithm based on Spark,”
*Chinese Journal of Computers*, Vol. 41, No. 4, pp. 868-885, 2018
- X. Y. Li, L. Y. Yu, H. Lei, and X. F. Tang, “A Parallel Implementation and Application of an Improved K-Means Algorithm,”
*Journal of University of **Electronic** Science and Technology of China*, Vol. 4601, pp. 61-68, 2017
- L. Y. Li, Y. M. Dong, and Y. Kong, “Improved MapReduce Parallelization of K-Means Algorithm,”
*Journal of Harbin University of Science and Technology*, pp. 31-35, 2016
- Y. H. Cui, W. Song, Z. B. Wang, S. C. Shi, and F. Q. Cheng, “A Grid-based Privacy Protection Clustering Data Mining Method,”
*Journal of Software*, Vol. 28, No. 9, pp. 2293-2308, 2017
- R. Feldman, O. Netzer, and B. Rosenfeld, “Utilizing Text Mining on Online Medical Forums to Predict Label Change due to Adverse Drug Reactions,” in
*Proceedings of **ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, ACM, pp. 1779-1788, 2015
- F. Qiao, Y. Z. Ge, and W. C. Kong, “Research on Distributed Improvement of Random Forest Student Employment Data Classification Model based on MapReduce,”
*Systems Engineering - Theory & Practice*, Vol. 37, No. 5, pp. 1383-1392, 2017
- K. Sun, “Research and Implementation of Machine Learning Application Framework based on Spark,” Shanghai Jiaotong University, 2015
- P. Cao, “Optimization and Implementation of Clustering Algorithm based on Spark Platform,” Beijing Jiaotong University, 2016
- B. Zhang, “Parallelization and Optimization of K-Means Algorithm based on Spark,” Huazhong University of Science and Technology, 2015
- Y. Liang, “Parallelization of Data Mining Algorithms based on Distributed Platforms Spark and YARN,” Sun Yat-Sen University, 2014
- Y. H. Zhang and F. G. Li, “Parallelization of KMeans Clustering Algorithm based on MapReduce,”
*Journal of Jiujiang** **University** *(*Natural Science Edition*), pp. 73-75, 2017
- Y. Yang, S. X. Ren, J. Yan, and C. Q. Li, “Improved Log-based Optimization based on K-Means Algorithm for Web Log Mining,”
*Journal of Computer Applications*, Vol. 36, No. S1, pp. 29-32+36, 2016
- D. F. Wang and L. Meng, “Performance Analysis and Parameter Selection of Particle Swarm Optimization Algorithm,”
*Acta** **Automatica** **Sinica*, Vol. 42, No. 10, pp. 1552-1561, 2016
- Y. N. Liao, M. J. Li, and Y. Q. Zhang, “K-Means Clustering-Particle Swarm Optimization Multi-Target Localization Algorithm,”
*Electronic Design Engineering*, Vol. 26, No. 2, pp. 56-60, 2018
- X. X. Lin and M. X. Zhao, “A K-Means Algorithm based on Improved Particle Swarm Optimization Algorithm,”
*Journal of Shandong University of Technology* (*Natural Science*), Vol. 29, No. 5, pp. 16-20, 2015
- X. D. Wu and S. Q. Qi, “Comparison of MapReduce and Spark for Big Data Analysis,”
*Journal of Software*, 2018
- C. Bian, W. Yu, and C. T. Ying, “Adaptive Cache Management Strategy for Parallel Computing Framework Spark,”
*Chinese Journal of Electronics*, Vol. 45, No. 2, pp. 24-30, 2017
Please note : You will need Adobe Acrobat viewer to view the full articles. |