Title: The Performance of Consistent Checkpointing in Distributed Shared Memory Systems
Authors: Gilbert Cabillic, Gilles Muller, Isabelle Puaut
Authors' address: IRISA, Campus de Beaulieu,
35042 Rennes Cedex,
FRANCE

Abstract: This paper presents the design and implementation of a consistent checkpointing scheme for Distributed Shared Memory (DSM) systems. Our approach relies on the integration of checkpoints within synchronization barriers already existing in applications; this avoids the need to introduce an additional synchronization mechanism. The main advantage of our checkpointing mechanism is that performance degradation arises only when a checkpoint is being taken; hence, the programmer can adjust the trade-off between the cost of checkpointing and the cost of longer rollbacks by adjusting the time between two successive checkpoints. The paper compares several implementations of the proposed consistent checkpointing mechanism (incremental, non-blocking, and pre-flushing) on the Intel Paragon multicomputer for several parallel scientific applications. Performance measures show that a careful optimization of the checkpointing protocol can reduce the time overhead of checkpointing from 8% to 0.04% of the application duration for a 6 mn checkpointing interval.

Keywords: Distributed shared memory, Backward error recovery, Consistent checkpointing.

Paper available in postscript form (57K).