Номер журнала: 2018.4

Заголовок статьи: Analysis of resource management system PBS/TORQUE robustness tools


The paper is devoted to the problem of distributed computing systems robustness with PBS/TORQUE when running parallel MPI programs. The impact of faults and failures events on the ability to terminate a parallel program is considered. The theoretical analysis of potential possibilities of tools for providing fault tolerance of PBS/TORQUE resource management system is presented in the first part of the paper. The results of the analysis on the base of experimental study using the resources of the existing distributed compu-ting system of the SibSUTIS center of parallel computing technologies and the ISP SB RAS laboratory of computing systems are presented in the second part.


A. Efimov, K. Pavsky

Keywords

distributed computing systems, resource management system, PBS/TORQUE, robustness, analysis, fault and failure

