| Sign In to gain access to subscriptions and/or personal tools. |
Process Fault Tolerance: Semantics, Design and Applications for High Performance ComputingINNOVATIVE COMPUTING LABORATORY, COMPUTER SCIENCE DEPARTMENT UNIVERSITY OF TENNESSEE, KNOXVILLE, TN 37996-3450, USA, (FAGG{at}CS.UTK.EDU)
HIGH PERFORMANCE COMPUTING CENTER STUTTGART, UNIVERSITY OF STUTTGART, D-70550 STUTTGART, GERMANY, AND INNOVATIVE COMPUTING LABORATORY, COMPUTER SCIENCE DEPARTMENT UNIVERSITY OF TENNESSEE, KNOXVILLE, TN 37996-3450, USA
INNOVATIVE COMPUTING LABORATORY, COMPUTER SCIENCE DEPARTMENT UNIVERSITY OF TENNESSEE, KNOXVILLE, TN 37996-3450, USA, (FAGG{at}CS.UTK.EDU) With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the semantics of a fault-tolerant version of the message passing interface (MPI), the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well-defined way. We present the architecture of fault-tolerant MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.
Key Words: Parallel computing fault tolerance MPI and message passing
International Journal of High Performance Computing Applications, Vol. 19, No. 4,
465-477 (2005) |
|||