Highly available DSM architectures
Scalable shared memory architectures have a high probability of failure due to their large number of components. Hence, fault tolerance mechanisms are needed to allow the execution of long-running parallel applications on such architectures. In the framework of Aleth research activity, I have worked on the design of highly available scalable shared memory multiprocessors and on the design and implementation of efficient fault tolerant software DSM systems.
We have shown that the class of Cache Only Memory Architectures (COMA)
is well-suited to the implementation of a backward error recovery strategy
which allows to tolerate any single node failure (joint work with Alain
Gefflaut). The proposed solution relies on two COMA features: (1) data
replication mechanisms provided by COMAs are used to create recovery data
in the node memories, and (2) the absence of data fixed physical location
in a COMA simplifies the architecture reconfiguration in the event of
a node permanent failure.
The availability of COMAs is implemented by an extension of the coherence
protocol which manages the multiple copies of data in different nodes.
The implementation of the proposed protocol requires few hardware modifications
of COMA architectures that are described in the litterature. The extended
coherence protocol has been evaluated by simulation. Performance results
show that our approach is scalable. The cost of fault tolerance mechanisms
has been evaluated for failure-free executions.
The solution that we have proposed for COMAs is applicable in other architectures. In fact, a shared virtual memory system offers mechanisms that are similar to those exploited by the extended coherence protocol in a COMA. Workstations memories are used as caches and data have no fixed physical location. Data replication mechanisms that are implemented in hardware for cache lines in COMAs are implemented in software at a page granularity in a shared virtual memory system.
The objective in the design of Icare software DSM was to show that high availability and efficiency are not antinomic. The main costs of a software DSM relate to the resolution of a page fault. In a backward error recovery mechanism, the most expensive operation is the periodic establishment of recovery points. Moreover, the permanent fault of a node may lead to unbalanced load in the system. Icare integrates algorithms allowing to decrease all these costs. They rely on the following principles. Data replication, which inherent to the functioning of a software DSM, is exploited in Icare in order to decrease the cost of the backward error recovery protocol. On the contrary, data replication is necessary to the implementation of fault tolerance. Icare takes benefit from this necessary replication in order to, on one hand, increase the system efficiency in normal functioning and, on the other hand, to limit disturbances observed when the system is restarted in the event of a permanent failure. When a recovery point is established, recovery data is saved preferably on a node which has a high probability to reference the corresponding data in the near future. This mechanism allows to anticipate page faults. Finally, in order not to alter the system efficiency in the event of successive lost of one or several nodes, Icare integrates mechanisms which allow on one hand to distribute the load of the faulty node on the set of remaining safe nodes and on the other hand to uniformly distribute the charge of managing shared memory which was previously ensured by the faulty node. A first version of Icare has been implemented on an Intel Paragon machine in order to evaluate the scalability of the extended coherence protocol in the context of a software DSM. An entire prototype of Icare has then been implemented on Astrolab, a platform for experimenting distributed systems, made up of several workstations PC) interconnected by an ATM high speed network. Results show that the judicious placement of recovery data on different system nodes allows to increase the performance of some applications so as to obtain better execution times with Icare than with the same software DSM without fault tolerance mechanisms.
Last modified 12.11.2009 01:51 PM