Title: A survey of recoverable distributed shared memory systems
Authors: Christine Morin, Isabelle Puaut
Authors' address: IRISA, Campus de Beaulieu,
35042 Rennes Cedex,
FRANCE

Abstract: Distributed Shared Memory (DSM) systems provide a shared memory abstraction on distributed memory architectures (distributed memory multicomputers, networks of workstations). Such systems ease parallel application programming since the shared memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a DSM system increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverable DSM systems (RDSM) that provide a checkpointing mechanism to restart parallel computations after a site failure.

Keywords: Distributed Systems, Distributed shared memory, Availability, Backward error recovery, Consistent global states.

Paper available in postscript form (86K).