*Titre: *Distributed System Monitoring and Failure Diagnosis using Cooperative Virtual Backdoors

 

*Mots cles: *Fault tolerance, self-healing, high availability, operating system, distributed system, cluster

 

*Description : * As computing systems are becoming increasingly present in our life, more human activities depend on their correct functioning, resistance to failures and attacks, and quick repairing. Human intervention is not a solution when computer system monitoring and repairing must be done fast and reliably regardless of scale, networking availability, or system impairing. The first two steps towards automated system recovery is monitoring and failure diagnosis. The challenge is to provide these two functionalities even in the presence of failure or attacks, when the operating system of the system to be monitored or diagnosed might not be available. The project consists in developing a monitoring and diagnosis framework for computer failures and attacks within the Backdoor architecture.

 

The Backdoor provides access to the system memory and I/O devices from a remote system without involving the target operating system. The Backdoor concept has be prototyped using an intelligent NIC, such as Myrinet but it can be also implemented as a virtual machine over a virtual machine monitor, such as VMWare, UML or Xen. Previous work has been shown that backdoors can be programmed to execute various defensive activities such as monitoring, logging, kernel data structure repairing and even recovery of the application state from a failed system. However, there are situations when a standalone backdoor cannot decisively detect a failure of its system due to the lack of sufficient information. What can help in this situation is a cooperative detection, in which the states of multiple systems are available for comparison and repairing.

 

The objective of this project is to build a cooperative backdoor protocol over a cluster. The backdoor on each node will be virtual, i.e.  a software component executed on a virtual machine. The guest operating system, which is the target of monitoring and failure detection, is executed on a different virtual machine. The backdoors of each node in the cluster can communicate with the backdoors of other nodes in the cluster to perform cooperative tasks. For instance, a backdoor can verify the integrity of the local OS by periodic comparisons with images of the OS running on other nodes. Similarly, a backdoor can use another backdoor for remote logging or even for fast state refreshing or rebooting.

 

The internship will be split between Rutgers University, USA and INRIA/IRISA, Rennes. During the time spent at Rutgers, the student will implement a cooperative backdoor over one of the existing virtual machine monitors. As part of this stage, the student will explore the backdoor architecture and the existing VMM platforms to select the most appropriate one for cooperative backdoor implementation. He will collaborate with graduate students from Disco Lab at Rutgers who developed the autonomous backdoor. During the time at INRIA/IRISA the student will implement cooperative defensive activities like the ones described above using either Kerrighed (if available on a VMM at the time of the internship) or Linux, as target. The student will learn about the Kerrighed architecture and will collaborate with graduate students and Kerrighed team, including those currently involved in porting Kerrighed over a VMM.

 

 

 

*Bibliographie:

 

*Recovering Internet Services from Operating System Failures. F. Sultan, A. Bohra, P. Gallard, I Neamtiu, S. Smaldone, Y. Pan and L. Iftode. In IEEE Internet Computing, March/April 2005.

 

Christine Morin, Renaud Lottiaux, Geoffroy VallŽe, Pascal Gallard, David Margery, Jean-Yves Berthou, and Isaac Scherson. Kerrighed and Data Parallelism: Cluster Computing on Single System Image Operating Systems. In Proc. of Cluster 2004, September 2004. IEEE