IRISA

Séminaire

Vendredi 4 avril 1997 - 14h00
Salle de conférences Michel Métivier

João Gabriel Silva
Dep. Informatics Engineering
Univ. of Coimbra - Portugal

Do the nodes of distributed systems fail silently ?
An investigation by fault-injection.

One of the potential advantages of distributed systems is that the failure of one of the nodes does not prevent other nodes from going on doing their job. Unfortunately, if the several nodes are cooperating for some computation, that is only true if the system is explicitly prepared to handle the loss of a node. This graceful degradation can be quite tricky, unless the defective node fails cleanly, for instance by simply stopping any interaction with the outside world, that is, if it fails silently. This assumption is so convenient that most distributed algorithms developed up to now assume fail-silent nodes. But how real is that assumption ? Many experiments were conducted where faults have been injected in various systems, showing that standard machines are quite far from failing silently. Some of those experiments will be described, along with the underlying methods and assumptions.

Finally, the discussion will handle the question of whether those experiments mean that all the distributed algorithms that assume fail-silence should be trashed or not.