Phenix Workshop

Self-healing and Fault Tolerant Systems

December 1-2, 2005
IRISA Laboratory, Rennes, room Valorisation

back to schedule

Abstracts


Building Reliable Systems Using Backdoors

Stephen Smaldone, Ph.D candidate, Rutgers University

Abstract:

Despite decades of work on verification techniques, fault tolerance, and security, systems continue to remain vulnerable to failures and attacks. As system complexity increases, human-assisted monitoring, maintenance, and intervention become prohibitively costly, unacceptably slow, and sometimes ineffective. New system designs must now consider fault tolerance, recoverability, self-healing, and monitoring while not compromising performance. This talk explores our approach towards building reliable systems. Our work in this area is centered around the Backdoor (BD), a novel architecture which uses commodity programmable network interface cards (NICs) with specialized firmware and OS extensions to provide non-intrusive Remote Healing of systems. I will discuss the application of the BD for: (i) remote monitoring of healthy systems to detect system faults, (ii) remote repair of system state in-place, which allows an otherwise damaged system to continue service, and (iii) remote recovery of critical state that might still be intact in a failed system’s memory. In addition, this talk will explore multi-layer Defensive Architectures, in which BDs automate the execution of defensive activities on a single system, or between collaborative systems across the local and wide areas. Finally, Orion extends our previous work in remote monitoring to analyze system behavior to understand and possibly predict conditions leading to system faults.


Towards Automated Defense from Rootkit Attacks

Arati Baliga, Ph.D. candidate, Rutgers University

Abstract:

Spread of malware is a growing trend in today's increasingly networked world. Worm and virus writing is no longer done only for fun, it is more geared towards profit. A deadly kind of malware called rootkits evades detection and tries to hide it's presence from the administrator. Rootkits often consist of sniffers, log erasers and backdoors to allow the attacker to cover their tracks and retain control of you system remotely. When these are bundled with worms and viruses, they can escape detection from anti-virus software as well. We explore existing solutions to deal with rootkit attacks. We describe a model to detect and contain the effects of a rootkit attack automatically in a virtual machine environment.


Distributed Applications Recovery using Speculations

Cristian Tapus, Ph.D. candidate, California Institute of Technology (Caltech)

Abstract:

This talk examines the use of speculations, a form of distributed transactions, to improve reliability of distributed applications. A speculation is defined as a computation that is based on an assumption that is not validated before the computation is started. If the assumption is later found to be false, the computation is aborted and the state of the program is rolled back; if the assumption is found to be true, the results of the computation are committed. The primary difference between a speculation and a transaction is that a speculation is not isolated---for example, a speculative computation may send and receive messages, and it may modify shared objects. As a result, processes that share those objects may be absorbed into a speculation. Speculations define safe recovery lines that can be used to roll back distributed applications. First, I will discuss the syntax of speculative constructs and the operational semantics for speculative execution. Further, I will present two approaches to implementing speculations: first, as a set of programming language features inside a compiler, and second, as a kernel level module.


Information Flow Control in Mobile Systems

Nishkam Ravi, Ph.D. candidate, Rutgers University

Abstract:

Sharing private information while preserving privacy is a challenging task. Currently existing information-flow control models preserve privacy by isolating public data from private data. Data isolation, however, is not applicable to many real applications. In this talk I will present a new model for information-flow control called Non-Inference. Non-inference allows public data to be derived from private data, but requires that the adversary should not be able to infer the value of private data from public data. I will discuss the theoretical implications of Non-Inference, and show how it can be enforced using static program analysis in the context of location privacy. Finally, I will discuss a class of applications where Non-Inference can be applied.


Designing an Inter-vehicular network stack for Car-to-Car communication

Pravin Shankar, Ph.D. candidate, Rutgers University

Abstract:

Recent advances in wireless vehicle-to-vehicle (V2V) communication systems enable the development of Vehicular Ad Hoc Networks (VANET) and create significant opportunities for the deployment of a wide variety of vehicular applications and services. However vehicular networks have specific mobility conditions and application requirements that differentiate them from other networks. We propose an inter-vehicular network stack architecture which supports the unique demands of car-to-car communication. The data-link layer makes use of short-range wireless network interfaces (IEEE 802.11) as well as the cellular network (GPRS/3G), and switches between the two based on network conditions and application demands. The data aggregation and validation layer aggregates traffic information based on distance, and validates the information for correctness and timeliness. The Information Query layer enables vehicles to query for information about specific objects or places such as the current road conditions at some driver-specified location. Finally we describe TrafficView, a prototype system, which uses V2V communication to disseminate real-time vehicle position and traffic density information between cars.


Transparent Parallel Applications Checkpointing in Kerrighed

Matthieu Fertré, Ph.D. candidate, IRISA/Université de Rennes 1

Abstract:

Nowadays, clusters are widely used to execute scientific applications. These applications are often message-passing parallel applications with long execution time. Since the number of nodes in clusters is growing, the probability of a node failure during the execution of an application increases and the application execution time may be greater than the cluster mean time between failures (MTBF). To avoid restarting application from the beginning, some fault tolerant mechanisms such as checkpoint/restart are needed. Currently, checkpoint/restart mechanisms are either implemented directly in the application source code by applications programmers or are integrated in communication environments such as MPI or PVM. We propose in this paper a new approach in which checkpoint/restart mechanisms for parallel applications are implemented in a cluster single system image operating system. While this kernel level approach is more complex to implement than other approaches, it is more general because it does not require any modification, compilation or relinking of the applications whatever the communication environment they rely on. Our approach has been implemented in Kerrighed single system image operating system based on Linux.


Managing fault-tolerance and data consistency in the JuxMem grid data-sharing service

Sébastien Monnet, Ph.D. candidate, IRISA

Abstract:

Grid computing has recently emerged as a response to the growing demand for resources (processing power, storage, etc.) exhibited by scientific applications. Whereas complex computing infrastructures are available and are able to transparently schedule computations on distributed architectures, data storage and data transfer must still be managed explicitly by the programmer. We claim that storing, accessing, updating and sharing such data should be considered by applications as an external service. We propose such a service: JuxMem, whose goal is to provide transparent access to mutable data, while enhancing data persistence and consistency despite node disconnections or failures. This presentation focuses on the problem of handling the consistency of replicated data in the presence of failures. We propose a software architecture which decouples consistency management from fault tolerance management. We illustrate this architecture with a case study showing how to design a consistency protocol using fault-tolerant building blocks.