Preview

The Herald of the Siberian State University of Telecommunications and Information Science

Advanced search

Shared memory based MPI broadcast algorithm for NUMA systems

Abstract

Broadcast collective communication operation is used by many scientific applications and tend to limit overall parallel application scalability. As the number of cores per computer node keeps increasing, it becomes important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of Bcast operation for SMPlNUMA nodes. We describe an algorithm for Bcast that takes advantage of NUMA-specific placement of queues in memory for message transferring. On a Xeon Nehalem and Xeon Broadwell NUMA nodes, our implementation achieves on average 20-60 % speedup over Open MPI and MVAPICH.

About the Authors

M. .. Kurnosov
Сибирский государственный университет телекоммуникаций и информатики
Russian Federation


E. .. Tokmasheva
Сибирский государственный университет телекоммуникаций и информатики
Russian Federation


References

1. Thakur R., Rabenseifner R., Gropp W. Optimization of Collective Communication Operations in MPICH // High Performance Computing Applications. 2005. V. 19 (1). P. 49-66.

2. Sanders P., Speck J., Traff J. L. Two-Tree Algorithms for Full Bandwidth Broadcast, Reduction and Scan // Parallel Computing. 2009. V. 35 (12). P. 581-594.

3. Traff J., Ripke A. Optimal Broadcast for Fully Connected Processor-node Networks // Parallel and Distributed Computing. 2008. V. 68 (7). P. 887-901.

4. Bin Jia. Process Cooperation in Multiple Message Broadcast // Parallel Computing. 2009 V. 35. P. 572-580.

5. Lameter C. NUMA (Non-Uniform Memory Access): An Overview // ACM Queue. 2013. V. 11 (7). P. 1-12.

6. Li S., Hoefler T. and Snir M. NUMA-Aware Shared Memory Collective Communication for MPI // Proc. of the 22nd Int. symposium on High-performance parallel and Distributed computing, 2013. P. 85-96.

7. Wu M., Kendall R and Aluru S. Exploring Collective Communications on a Cluster of SMPs // Proc. of the HPCAsia, 2004. P. 114-117.

8. MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE // URL: http://mvapich.cse.ohio-state.edu/ (дата обращения: 12.12.2019).

9. Graham R L., Shipman G. MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives // Proc. of the 15th European PVM/MPI Users’ Group Meeting, 2008. P.130-140.

10. Jain S., Kaleem R, Balmana M., Langer A., Durnov D., Sann^ov A. and Garzaran M. Framework for Scalable Intra-Node Collective Operations using Shared Memory // Proc. of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC-2018), 2018. P. 374-385.


Review

For citations:


Kurnosov M..., Tokmasheva E... Shared memory based MPI broadcast algorithm for NUMA systems. The Herald of the Siberian State University of Telecommunications and Information Science. 2020;(1):42-59. (In Russ.)

Views: 264


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1998-6920 (Print)