Algorithms of fault-tolerant resources management of geographically distributed computer systems

A. Y. Polyakov; O. V. Moldovanova; A. A. Paznikov; M. G. Kernosov; S. N. Mamoilenko; A. V. Efimov

Algorithms of fault-tolerant resources management of geographically distributed computer systems

A. Y. Polyakov, O. V. Moldovanova, A. A. Paznikov, M. G. Kernosov, S. N. Mamoilenko, A. V. Efimov

Full Text:

PDF (Rus)

Generate QR code

Abstract

This paper considers the issues of fault-tolerant operation of geographically distributed computer systems (GDCS) in multitask modes. The model of GDCS operation consisting of many subsystems was proposed. Algorithms of fault-tolerant distributed job queue operation were developed. Algorithms of resource management for the GDCS subsystems were created. The program tools of fault-tolerant execution of parallel programs on GDCS subsystems were developed and implemented. Simulation results of the algorithms on the multicluster GDCS (GRID model) were presented.

Keywords

MPI, GRID, distributed computer systems, parallel multiprogramming, MPI, GRID, multicluster systems, dispatching, fault-tolerance, and self-diagnosis

About the Authors

A. Y. Polyakov

СибГУТИ; ИФП СО РАН
Russian Federation

O. V. Moldovanova

СибГУТИ
Russian Federation

A. A. Paznikov

СибГУТИ
Russian Federation

M. G. Kernosov

СибГУТИ
Russian Federation

S. N. Mamoilenko

СибГУТИ
Russian Federation

A. V. Efimov

СибГУТИ
Russian Federation

References

1. Хорошевский В.Г. Распределённые вычислительные системы с программируемой структурой// Вестник СибГУТИ. 2010. №2 (10). С. 3-41.

2. Torque Resource Manager [Электронный ресурс]. Режим доступа: http://www.adaptivecomputing.com/products/open-source/torque/ (дата обращения 03.09.2014).

3. SLURM: Simple Linux Utility for Resource Management, A. Yoo, M. Jette, and M. Grondona, Job Scheduling Strategies for Parallel Processing, volume 2862 of Lecture Notes in Computer Science, pages 44-60, Springer-Verlag, 2003.

4. Huedo E., Montero R.S., Llorente I.M. A framework for adaptive execution on grids // Software - Practice and Experience (SPE). 2004. Vol. 34. P. 631-651.

5. Berman F., Wolski R., Casanova H. Adaptive computing on the grid using AppLeS // IEEE Trans. on Parallel and Distributed Systems. 2003. Vol. 34. P. 369-382.

6. Cooper K., Dasgupta A., Kennedy C.K. [et al]. New Grid Scheduling and Rescheduling Methods in the GrADS Project // Proc. of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04). 2004. Vol. 34. P. 199-206.

7. Buyya R., Abramson D., Giddy J. Nimrod/G: An architecture for a resource management and scheduling system in a global computational Grid // Proc. of the 4th International Conference on High Performance Computing in Asia-Pacific Region. 2000. P. 283-289.

8. Frey J., Tannenbaum T., Livny M. [et al.] Condor-G: A computation management agent for multi-institutional grids // Cluster Computing. 2001. Vol. 5. P. 237-246.

9. Andreetto P., Borgia S., Dorigo A. Practical approaches to grid workload and resource management in the EGEE project / // In CHEP ’04: Proceedings of the Conference on Computing in High Energy and Nuclear Physics. 2004. Vol. 2. P. 899-902.

10. Wijngaards N., Overeinder B., Steen M., Brazier F. Supporting internet-scale multi-agent systems // Data Knowledge Engineering. 2002. Vol. 41. P. 229-245.

11. Caron E., Garonne V., Tsaregorodtsev A. Evaluation of Meta-scheduler Architectures and Task Assignment Policies for High Throughput Computing // Technical report № 5576. Institut National de Recherche en Informatique et en Automatique. 2005. 16p.

12. Deelman E., Singh G., Su M.-H. [et al.] Pegasus: A framework for mapping complex scientific workflows onto distributed systems // Scientific Programming. 2005. Vol. 13(3). P. 219-237.

13. Hull D., Wolstencroft K., Stevens R. [et al.] Taverna: a tool for building and running workflows of services / // Nucleic Acids Research. 2006. Vol. 34. P. 729-732.

14. Matthew I., Shields M., Wang I., Philp R. Grid Enabling Applications Using Triana // In Workshop on Grid Applications and Programming Tools. 2003. 11p.

15. Fahringer T., Prodan R., Duan R. [et al.] ASKALON: A Grid Application Development and Computing Environment // 6th IEEE/ACM International Workshop on Grid Computing. 2005. P. 122-131.

16. Young L., Mcgough S., Newhouse S., Darlington J. Scheduling Architecture and Algorithms within the ICENI Grid Middleware // In UK e-Science All Hands Meeting. 2003. P. 5-12.

17. Altintas I., Berkley C., Jaeger E. [et al.] Kepler: An Extensible System for Design and Execution of Scientific Workflows // International Conference on Scientific and Statistical Database Management. 2004. P. 21-23.

18. Kurnosov M., Paznikov A. Efficiency analysis of decentralized grid scheduling with job migration and replication // ACM International Conference on Ubiquitous Information Management and Communication. 2013. 7 p.

19. Feitelson D.G., [et. al]. Theory and practice in parallel job scheduling // Job Scheduling Strategies for Parallel Processing. 1997. Vol. 1291. P. 1 - 34.

20. Shmueli E.,Feitelson D.G. Backfilling with lookahead to optimize the packing of parallel jobs. J. Parallel & Distributed Comput. 2005. Vol. 65. Iss. 9. P. 1090 - 1107.

21. Cirne W., Grande C., Berman F. When the herd is smart aggregate behavior in the selection of job request. IEEE Transactions in Parallel and Distributed Systems. 2003. Vol. 14. P. 181 - 192.

22. Мамойленко С.Н., Ефимов А.В. Алгоритмы планирования решения масштабируемых задач на распределённых вычислительных системах. Вестник ГОУ ВПО «СибГУТИ». 2010. № 2. С. 66 - 78.

23. Cirne W., Berman F. A model for moldable supercomputer jobs. 15th Intl. Parallel & Distributed Processing Symp. 2001 URL: http://cseweb.ucsd.edu/~walfredo/papers/ moldability-model.pdf (дата обращения: 29.09.2014).

24. Elnozahy, E.N., [et. al.] A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys. 2002. Vol. 34. N. 3. P. 375 - 408.

25. Ansel J., Arya K., Cooperman G. DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop. IEEE International Parallel and Distributed Processing Symposium (IPDPS'09). 2009. 12 p.

26. Поляков А.Ю. О восстановлении программ из контрольной точки / А.Ю. Поляков. Вестник ЮУрГУ. Серия «Математическое моделирование и программирование». 2010. № 35(211). С. 91 - 103.

Review

For citations:

Polyakov A.Y., Moldovanova O.V., Paznikov A.A., Kernosov M.G., Mamoilenko S.N., Efimov A.V. Algorithms of fault-tolerant resources management of geographically distributed computer systems. The Herald of the Siberian State University of Telecommunications and Information Science. 2014;(4):11-29. (In Russ.)

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 1998-6920 (Print)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

The Herald of the Siberian State University of Telecommunications and Information Science

Algorithms of fault-tolerant resources management of geographically distributed computer systems

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy