Performance Improvements by Deploying L2 Prefetchers with Helper Thread for Pointer-Chasing Applications

doi:10.23940/ijpe.18.10.p7.23122320

Abstract

Abstract:

Modern processor micro-architecture offers advanced prefetch mechanisms that are designed to effectively hide memory latency and improve application performance. However, pointer-chasing applications employing linked data structures expose a memory latency problem that is difficult to deal with by using hardware prefetchers. It is promising that helper threaded prefetching based on Chip Multiprocessor is an effective method for reducing the memory latency of accesses to linked data structures. In this paper, we first illustrated two L2 prefetchers on Chip Multiprocessor and two different helper threaded prefetching techniques for pointer-chasing applications. Then, we revealed the limitations of L2 prefetchers for pointer-intensive applications after applying two different threaded prefetching techniques. Finally, we optimized the deployment of L2 prefetchers with two different threaded prefetching techniques for pointer-chasing applications. The experimental results indicate that L2 prefetchers’ effectiveness on helper threads depends on the memory access pattern of the targeted applications, and the optimized deployment of L2 prefetchers further improves the performance of pointer-intensive applications.

Submitted on July 10, 2018; Revised on August 12, 2018; Accepted on September 11, 2018
References: 17

Yan Huang, Huidong Zhu, and Yuhua Li. Performance Improvements by Deploying L2 Prefetchers with Helper Thread for Pointer-Chasing Applications [J]. Int J Performability Eng, 2018, 14(10): 2312-2320.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References 0

	A. J. Smith, “Cache Memories,” Computing Surveys, Vol. 14, No. 3, pp. 473-530, 1982
	S. Byna, Y. Chen, and X. H. Sun, “A Taxonomy of Data Prefetching Mechanisms,” Journal of Computer Science and Technology, Vol. 24, No. 3, pp. 405-417, 2009
	T. F. Chen and J. L. Baer. “Effective Hardware-based Data Prefetching for High-Performance Processors,” IEEE Transactions on Computers, Vol. 44, No. 5, pp. 609-623, 1995
	A. Gendler, A. Mendelson, and Y. Birk, “A PAB-based, Multi-Prefetcher Mechanism,” International Journal of Parallel Programming, Vol. 34, No. 2, pp. 171-188, 2006
	A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal, et al., “Cache QoS: From Concept to Reality in the Intel Xeon ProcessorE5-2600 v3 Product Family,” in Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 457-468, April 2016
	J. Lee, C. Jung, D. Lim, and Y. Solihin, “Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems,” IEEE Transactions on Parallel and Distributed System, Vol. 20, No. 9, pp. 1309-1324, 2009
	J. Lee, H. Kim, M. Shin, and J. Kim, “Mutually Aware Prefetcher and On-Chip Network Designs for Multi-Cores,” IEEE Transactions on Computers, Vol. 63, No. 9, pp. 2316-2329, 2014
	D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis, “Heracles: Improving Resource Efficiency at Scale,” in Proceedings of 42nd Annual International Symposium on Computer Architecture (ISCA), pp. 450-462, June 2015
	X. D. Wang, S. Chen, J. Setter, and J. F. Martinez, “SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support,” in Proceedings of 2017 IEEE Internatinal Symposium on High-Performance Computer Architecture (HPCA), pp. 121-132, May 2017
	X. D. Wang and J. F. Martínez, “ReBudget: Trading off Efficiency vs. Fairness in Market-based Multicore Resource Allocation via Runtime Budget Reassignment,” in Proceedings of 21st International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 19-32, April 2016
	A. D. Blanche and T. Lundqvist, “Addressing Characterization Methods for Memory Contention Aware Co-Scheduling,” The Journal of Supercomputing, Vol. 71, No. 4, pp. 1451-1483, 2013
	G. Kaur, “A DAG based Task Scheduling Algorithms for Multiprocessor System - A Survey,” International Journal of Grid and Distributed Computing, Vol. 9, No. 9, pp. 103-114, 2016
	Q. Tian, J. M. Li, F. Y. Zheng, and S. Zhao, “A Cache Consistency Protocol with Improved Architecture,” International Journal of Performability Engineering, Vol. 14, No. 1, pp. 178-185, 2018
	Q. Zhang, Y. F. Ge, H. Liang, and J. Shi, “A Load Balancing Task Scheduling Algorithm based on Feedback Mechanism for Cloud Computing,” International Journal of Grid and Distributed Computing, Vol. 9, No. 4, pp. 41-52, 2016
	Y. Huang, J. Tang, Z. M. Gu, M. Cai, J. X. Zhang, and N. H. Zheng, “The Performance Optimization of Threaded Prefetching for Linked Data Structures,” International Journal of Parallel Programming, Vol. 39, No. 6, pp.1-23, 2012
	C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-Aware DRAM Controllers,” in Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, pp. 200-209, November 2008
	C. B. Zilles, “Benchmark Health Considered Harmful,” ACM SIGARCH Computer Architecture News, Vol. 29, No. 3, pp. 4-5, 2001