Просмотр статьи

Номер журнала: 2020.1

Заголовок статьи: Shared memory based MPI broadcast algorithm for NUMA systems


Broadcast collective communication operation is used by many scientific applications and tend to limit overall parallel application scalability. As the number of cores per computer node keeps increasing, it becomes important for MPI to leverage shared memory for in-tranode communication. This paper investigates the design and optimization of Bcast operation for SMP/NUMA nodes. We describe an algorithm for Bcast that takes advantage of NUMA-specific placement of queues in memory for message transferring. On a Xeon Nehalem and Xeon Broadwell NUMA nodes, our implementation achieves on average 20–60 % speedup over Open MPI and MVAPICH.


M. Kurnosov, E. Tokmasheva

Ключевые слова

Bcast, broadcast, MPI, NUMA, collective communication, parallel programming, high-performance computing

Скачать полный текст