This page describes the Phenix "associate team", a joint research effort involving the PARIS team of INRIA Rennes (IRISA) (France) and the Discolab laboratory at Rutgers University (USA). This effort is funded by the office of international relations of INRIA since January, 2005.
|
|
As computer systems become more complex, they also become more vulnerable to failures and attacks. Survivability and recoverability without compromising performance have emerged as guiding principles for system design[4]. The need for such features is exacerbated by critical applications that do not tolerate downtime and by an increasing demand for high availability over large scales (e.g., in the Internet). Computer vendors have already launched initiatives and even market systems with built-in support for hardware and firmware fault detection and containment[1,2]. However, despite previous research, existing commodity operating systems still lack support for accurate detection of anomalies (failure, attacks, etc.), and for fast and effective response in thwarting attacks and healing the effects of failures. Our approach in the proposed research is to perform monitoring and healing actions on the operating system state remotely. To achieve this, we plan to use the Backdoor architecture originally proposed by the DiscoLab team[5]. Using Backdoors, remote healing monitoring and recovery/repair actions can be performed even when the processors of the remote system are not available due to failure or attack. Each computer in such a system will be equipped with a Backdoor, an intelligent network interface that provides the low-level functions required for remote access to its resources (memory and I/O devices). Backdoors will be connected through a secure private network, through which any computer in the system can perform monitoring, tuning and healing actions on a remote system. Clusters are now used to support the execution of data services in the Internet. The ultimate goal of the work we propose is the design of a self-healing operating system for clusters.
Keywords:
Cluster, operating system, system area network, RDMA, I/O, high
availability, self-healing systems