Shared memory based MPI broadcast algorithm for NUMA systems

M. .. Kurnosov; E. .. Tokmasheva

Shared memory based MPI broadcast algorithm for NUMA systems

M. .. Kurnosov, E. .. Tokmasheva

Full Text:

PDF (Rus) |

Generate QR code

Abstract

Broadcast collective communication operation is used by many scientific applications and tend to limit overall parallel application scalability. As the number of cores per computer node keeps increasing, it becomes important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of Bcast operation for SMPlNUMA nodes. We describe an algorithm for Bcast that takes advantage of NUMA-specific placement of queues in memory for message transferring. On a Xeon Nehalem and Xeon Broadwell NUMA nodes, our implementation achieves on average 20-60 % speedup over Open MPI and MVAPICH.

Keywords

Bcast, broadcast, MPI, NUMA

About the Authors

M. .. Kurnosov

Сибирский государственный университет телекоммуникаций и информатики
Russian Federation

E. .. Tokmasheva

Сибирский государственный университет телекоммуникаций и информатики
Russian Federation

References

1. Thakur R., Rabenseifner R., Gropp W. Optimization of Collective Communication Operations in MPICH // High Performance Computing Applications. 2005. V. 19 (1). P. 49-66.

2. Sanders P., Speck J., Traff J. L. Two-Tree Algorithms for Full Bandwidth Broadcast, Reduction and Scan // Parallel Computing. 2009. V. 35 (12). P. 581-594.

3. Traff J., Ripke A. Optimal Broadcast for Fully Connected Processor-node Networks // Parallel and Distributed Computing. 2008. V. 68 (7). P. 887-901.

4. Bin Jia. Process Cooperation in Multiple Message Broadcast // Parallel Computing. 2009 V. 35. P. 572-580.

5. Lameter C. NUMA (Non-Uniform Memory Access): An Overview // ACM Queue. 2013. V. 11 (7). P. 1-12.

6. Li S., Hoefler T. and Snir M. NUMA-Aware Shared Memory Collective Communication for MPI // Proc. of the 22nd Int. symposium on High-performance parallel and Distributed computing, 2013. P. 85-96.

7. Wu M., Kendall R and Aluru S. Exploring Collective Communications on a Cluster of SMPs // Proc. of the HPCAsia, 2004. P. 114-117.

8. MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE // URL: http://mvapich.cse.ohio-state.edu/ (дата обращения: 12.12.2019).

9. Graham R L., Shipman G. MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives // Proc. of the 15th European PVM/MPI Users’ Group Meeting, 2008. P.130-140.

10. Jain S., Kaleem R, Balmana M., Langer A., Durnov D., Sann^ov A. and Garzaran M. Framework for Scalable Intra-Node Collective Operations using Shared Memory // Proc. of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC-2018), 2018. P. 374-385.

Review

For citations:

Kurnosov M..., Tokmasheva E... Shared memory based MPI broadcast algorithm for NUMA systems. The Herald of the Siberian State University of Telecommunications and Information Science. 2020;(1):42-59. (In Russ.)

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1998-6920 (Print)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

The Herald of the Siberian State University of Telecommunications and Information Science

Shared memory based MPI broadcast algorithm for NUMA systems

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy