*Titre:
*Distributed System Monitoring and Failure Diagnosis using Cooperative Virtual
Backdoors
*Mots cles:
*Fault tolerance, self-healing, high availability, operating system,
distributed system, cluster
*Description
: * As computing systems are becoming increasingly present in our life, more
human activities depend on their correct functioning, resistance to failures
and attacks, and quick repairing. Human intervention is not a solution when
computer system monitoring and repairing must be done fast and reliably
regardless of scale, networking availability, or system impairing. The first
two steps towards automated system recovery is monitoring and failure
diagnosis. The challenge is to provide these two functionalities even in the
presence of failure or attacks, when the operating system of the system to be
monitored or diagnosed might not be available. The project consists in
developing a monitoring and diagnosis framework for computer failures and
attacks within the Backdoor architecture.
The Backdoor
provides access to the system memory and I/O devices from a remote system
without involving the target operating system. The Backdoor concept has be
prototyped using an intelligent NIC, such as Myrinet but it can be also
implemented as a virtual machine over a virtual machine monitor, such as
VMWare, UML or Xen. Previous work has been shown that backdoors can be
programmed to execute various defensive activities such as monitoring, logging,
kernel data structure repairing and even recovery of the application state from
a failed system. However, there are situations when a standalone backdoor
cannot decisively detect a failure of its system due to the lack of sufficient
information. What can help in this situation is a cooperative detection, in
which the states of multiple systems are available for comparison and
repairing.
The objective
of this project is to build a cooperative backdoor protocol over a cluster. The
backdoor on each node will be virtual, i.e. a software component executed on a virtual machine. The guest
operating system, which is the target of monitoring and failure detection, is
executed on a different virtual machine. The backdoors of each node in the
cluster can communicate with the backdoors of other nodes in the cluster to
perform cooperative tasks. For instance, a backdoor can verify the integrity of
the local OS by periodic comparisons with images of the OS running on other
nodes. Similarly, a backdoor can use another backdoor for remote logging or
even for fast state refreshing or rebooting.
The
internship will be split between Rutgers University, USA and INRIA/IRISA,
Rennes. During the time spent at Rutgers, the student will implement a
cooperative backdoor over one of the existing virtual machine monitors. As part
of this stage, the student will explore the backdoor architecture and the
existing VMM platforms to select the most appropriate one for cooperative
backdoor implementation. He will collaborate with graduate students from Disco
Lab at Rutgers who developed the autonomous backdoor. During the time at
INRIA/IRISA the student will implement cooperative defensive activities like
the ones described above using either Kerrighed (if available on a VMM at the
time of the internship) or Linux, as target. The student will learn about the
Kerrighed architecture and will collaborate with graduate students and
Kerrighed team, including those currently involved in porting Kerrighed over a
VMM.
*Bibliographie:
*Recovering
Internet Services from Operating System Failures. F. Sultan, A. Bohra, P.
Gallard, I Neamtiu, S. Smaldone, Y. Pan and L. Iftode. In IEEE Internet
Computing, March/April 2005.
Christine
Morin, Renaud Lottiaux, Geoffroy VallŽe, Pascal Gallard, David Margery,
Jean-Yves Berthou, and Isaac Scherson. Kerrighed and Data Parallelism: Cluster
Computing on Single System Image Operating Systems. In Proc. of Cluster 2004,
September 2004. IEEE