Fault Tolerant Shared Memory Multiprocessor Architectures
In the framework of FASST European Esprit project, I worked on the design of an architecture based on the stable memory technology in which fault tolerance is transparent for the system and applications. The proposal is a shared memory multiprocessor architecture which provides a hardware backward error recovery mechanism relying on the properties of the shared memory which is a recoverable shared memory (RSM).
The RSM is used to store processor recovery points. It stores two values for each memory block: the current value accessed by processors and the recovery value updated when a recovery point is established.
The problem raised by the backward error recovery protocol is the coherence of the saved state in presence of communicating processors. We have developed a new mechanism in which processors coordinate the establishment of their recovery points in order to ensure the coherence of the saved system state. To do so, the RSM records inter-processor dependencies when communications occur i.e. when processors access shared data. When a processor establishes a recovery point, the RSM computes from the dependency information the group of processors that have also to establish a recovery point. All these processors coordinate the establishment of their recovery point to atomically save a coherent global system state. This technique copes with standard caches and cache coherence protocols and does not suffer the domino effect in the event of a rollback.
A performance evaluation has been done by simulation with parallel application traces. A tool for simulating shared memory multiprocessors (SPAM) has been developed. This tool comprises of an execution trace generator based on a code instrumentation method and of a simulator based on an execution driven technique. This study shows a weak performance degradation compared with an identical architecture without fault tolerance mechanisms. It also shows that our architecture obtains better performance than fault tolerant shared memory multiprocessors such as Sequoia and Carer in which recovery point establishments are much more frequent. The dependency management mechanism makes the recovery point establishment frequency independent from the machine workload. These studies also show that it is essential, in order to obtain good performance, to minimize the cost of memory block copies at the establishment of a recovery point.
The recovery protocol implemented in hardware must be taken into account by the operating system. Few modification have to be done in the microkernel. They deal with the protocol integration that requires that a processor saves its registers in a predefined memory area ad flushes the modified data of its cache when a recovery point is established. When a rollback takes place after a processor failure, the system has to guarantee the coherence of its internal data such as the lists used by the scheduler. Finally, the most difficult problem is related to the management of I/O which are non recoverable operations. The system has to take actions in order to avoid the lost or duplication of I/O requests in presence of processor failures.Last modified 12.11.2009 01:51 PM