Advanced Search

Journal Navigation

Journal Home

Subscriptions

Archive

Contact Us

Table of Contents

CiteULike is a free service for managing and discovering scholarly references - click here to get started.

Sign In to gain access to subscriptions and/or personal tools.
International Journal of High Performance Computing Applications
This Article
Right arrow Full Text (PDF)
Right arrow References
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via Web of Science (3)
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Fagg, G. E.
Right arrow Articles by Dongarra, J. J.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing

Graham E. Fagg

INNOVATIVE COMPUTING LABORATORY, COMPUTER SCIENCE DEPARTMENT UNIVERSITY OF TENNESSEE, KNOXVILLE, TN 37996-3450, USA, (FAGG{at}CS.UTK.EDU)

Edgar Gabriel

HIGH PERFORMANCE COMPUTING CENTER STUTTGART, UNIVERSITY OF STUTTGART, D-70550 STUTTGART, GERMANY, AND INNOVATIVE COMPUTING LABORATORY, COMPUTER SCIENCE DEPARTMENT UNIVERSITY OF TENNESSEE, KNOXVILLE, TN 37996-3450, USA

Zizhong Chen

Thara Angskun

George Bosilca

Jelena Pjesivac-Grbovic

Jack J. Dongarra

INNOVATIVE COMPUTING LABORATORY, COMPUTER SCIENCE DEPARTMENT UNIVERSITY OF TENNESSEE, KNOXVILLE, TN 37996-3450, USA, (FAGG{at}CS.UTK.EDU)

With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the semantics of a fault-tolerant version of the message passing interface (MPI), the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well-defined way. We present the architecture of fault-tolerant MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.

Key Words: Parallel computing • fault tolerance • MPI and message passing

International Journal of High Performance Computing Applications, Vol. 19, No. 4, 465-477 (2005)
DOI: 10.1177/1094342005056137


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter    What's this?