Preview

The Herald of the Siberian State University of Telecommunications and Information Science

Advanced search

Analysis of resource management system PBS/TORQUE robustness tools

Abstract

The paper is devoted to the problem of distributed computing systems robustness with PBS/TORQUE when running parallel MPI programs. The impact of faults and failures events on the ability to terminate a parallel program is considered. The theoretical analysis of potential possibilities of tools for providing fault tolerance of PBS/TORQUE resource management system is presented in the first part of the paper. The results of the analysis on the base of experimental study using the resources of the existing distributed computing system of the SibSUTIS center of parallel computing technologies and the ISP SB RAS laboratory of computing systems are presented in the second part.

About the Authors

A. .. Efimov
СибГУТИ; ИФП СО РАН
Russian Federation


K. .. Pavsky
СибГУТИ; ИФП СО РАН
Russian Federation


References

1. Хорошевский В. Г. Архитектура вычислительных систем. М.: МГТУ им. Н. Э. Баумана, 2008. 520 с.

2. Schroeder В., Gibson Garth A. A large-scale study of failures in high-performance computing systems // Proceedings of the International Conference on Dependable Systems and Networks (DSN2006), 2006. 10 р.

3. Gupta S., Patel T., Engelmann C., Tiwari D. Failures in large scale systems: long-term measurement, analysis, and implications // Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017. P. 1-12.

4. Каляев И. А., Коробкин В. В., Мельник Э. В., Малахов И. В. Отказоустойчивый управляющий вычислительный комплекс машины перегрузочной атомного реактора типа ВВЭР // Мехатроника, автоматизация, управление. 2003. № 3. С. 143-146.

5. Korobkin V., Melnik E., Klimenko A. Fault-tolerant architecture for the hazardous object information control systems // 2015 IEEE conference “Application of information and communication technologies” (IEEE catalog number CFPI556H-PRT). P. 274-276.

6. Cappello F., Geist A., Gropp W., Kale S., et all. Toward Exascale Resilience: 2014 update // Supercomputing frontiers and innovations. 2014. V. 1, № 1. P. 1 -28.

7. Torque Resource Manager [Электронный ресурс]. URL: http:// www. adaptivecomputing. com/products/torque/ (дата обращения: 12.10.2018).

8. Maui Cluster Scheduler [Электронный ресурс]. URL: http:// www.adaptivecomputing.com/support/download-center/maui-cluster-scheduler/ (дата обращения: 12.10.2018).

9. Torque Resource Manager Administrator Guide 6.1.2 [Электронный ресурс]. URL: http://docs.adaptivecomputing.com/torque/6-1-2/adminGuide /torqueAdminGuide-6.1.2.pdf (дата обращения: 12.10.2018).

10. Maui Administrator's Guide [Электронный ресурс]. URL: http:// docs.adaptivecomputing.com/maui/pdf/mauiadmin.pdf (дата обращения: 12.10.2018).

11. GDB: The GNU Project Debugger [Электронный ресурс]. URL: http:// www.gnu.org/software/gdb/ (дата обращения: 12.10.2018).

12. LBNL Node Health Check [Электронный ресурс]. URL: https:// github.com/mej/nhc (дата обращения: 12.10.2018).

13. Moab Cloud HPC Suite [Электронный ресурс]. URL: http:// www.adaptivecomputing.com/moab-hpc-basic-edition/ (дата обращения: 12.10.2018).

14. An active/passive NFS server in a red hat high availability cluster [Электронный ресурс]. URL: https://access.redhat.com/documentation/en-us/red hat enterprise linux/7/html/high availability addon administration/ch-nfsserver-haaa (дата обращения: 12.10.2018).

15. Elnozahy E. N., Alvisi L., Wang Y. M., Johnson D. B. A survey of rollback-recovery protocols in message-passing systems // ACM Computing Surveys. 2002. V. 34, № 3. P. 375-408.

16. Duell J., Hargrove P., Roman E. The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart // Berkeley Lab Technical Report. 2002. 17 p.

17. Ansel J., Arya K., Cooperman G. DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop // IEEE International Parallel and Distributed Processing Symposium, 2009. 12 p.

18. Message Passing Interface [Электронный ресурс]. URL: https:// en.wikipedia.org/wiki/Message Passing Interface (дата обращения: 12.10.2018).

19. Pavsky V. A., Pavsky K. V., Paznikov A. A. Mathematical models and calculation of reliability indices of scalable distributed computer systems under full restoration // Proceedings of XIV International scientific-technical conference “Actual Problems of Electronic Instrument Engineering” (APEIE-2018). NSTU. V. 1, Part 4. Novosibirsk. 2018. P. 502-505.

20. Павский В. А., Павский К. В. Расчет показателей потенциальной живучести для распределенных вычислительных систем при групповом восстановлении отказавших машин // Материалы 4-й Всероссийской научно-технической конференции «Суперкомпьютерные технологии», 19-24 сентября 2016 г., Ростов-на-Дону. Т. 2. С 86-89.


Review

For citations:


Efimov A..., Pavsky K... Analysis of resource management system PBS/TORQUE robustness tools. The Herald of the Siberian State University of Telecommunications and Information Science. 2018;(4):98-106. (In Russ.)

Views: 415


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1998-6920 (Print)