Single System Image OS for Clusters: Kerrighed Approach

Christine Morin, INRIA

In my talk, I will present Kerrighed, a single system image operating system for high performance computing on clusters. Kerrighed targets ease of programming, high performance and high availability. Ease of programming is achieved as Kerrighed supports both the message passing and the shared memory programming models. Kerrighed takes benefit of the underlying hardware performance by providing global management of all cluster resources (processor, memory and disk).

Kerrighed also provides dynamic resource management to make cluster configuration changes transparent to the applications and to guarantee the system availability in presence of node failures. Kerrighed offers sequential and parallel applications a checkpointing facility. Several kinds of applications can take advantage of Kerrighed. We currently target scientific applications such as numerical simulations (including OpenMP, MPI and Posix multithreaded applications).

Kerrighed is implemented as an extension to Linux operating system (a set of Linux modules and a small patch to the kernel). Kerrighed is independent of the cluster interconnection network.


Christine Morin

Christine Morin holds a research director position at INRIA(http://www.inria.fr). She carries out her research activities in
the PARIS project-team (http://www.irisa.fr/paris) at IRISA (http://www.irisa.fr) research center (INRIA research unit in Rennes).
She currently leads a research activity aiming at designing and building a single system image operating system, called Kerrighed (formely,
Gobelins), for high performance computing on clusters (http://www.kerrighed.org). She has made contribution to the design of
fault tolerant shared memory multiprocessor architectures (SMP, COMA, clusters) and to the design of distributed systems.
Christine Morin received an engineering degree from the Institut National des Sciences Appliquées (INSA), of Rennes (France), in 1987 and
master and PhD degrees in computer science from the University of Rennes I in 1987 and 1990, respectively. She received the "Habilitation à diriger des recherches" in computer science from the University of Rennes 1 in 1998.

Open Source Cluster Application Resources (OSCAR)

Dr. Stephen L. Scott, ORNL

The Open Source Cluster Application Resources (OSCAR) is a cluster software stack providing a complete infrastructure for cluster
computing. The OSCAR project started in April 2000 with its first public release a year later as a self-installing compilation of "best
practices" for high-performance classic Beowulf cluster computing. Since its inception approximately three years ago, OSCAR has matured to
include cluster installation, maintenance, and operation capabilities and as a result has become one of the most popular cluster computing
packages worldwide. In the past year, OSCAR has begun to expand into other cluster paradigms including a diskless cluster solution (Thin
OSCAR) and the high-availability version embracing fault tolerant capabilities (HA-OSCAR). In this talk, I will discuss the current status of the OSCAR project including some preliminary information on its two latest invocations - Thin OSCAR and HA-OSCAR.

Stephen Scott (scottsl@ornl.gov)


Dr. Stephen L. Scott is a senior research scientist in the Network and Cluster Computing Group of the Computer Science and Mathematics Division of Oak Ridge National Laboratory - USA. Stephen's responsibilities include research and development efforts in high performance scalable cluster computing. Primary research interest is in experimental systems with a focus on high performance, scalable, distributed, heterogeneous, and parallel computing. Stephen is a founding member and on the steering committee of The Open Cluster Group (OCG), a consortium of research and industry dedicated to making cluster computing practical for high performance computing. He is also a founding member, version 2 release manager, and past working group chair of the OCG's primary working group, Open Source Cluster Application Resources (OSCAR). This working group is dedicated to bringing current "best practices" in cluster computing to all users via a self-installing software suite. He is also a contributor to the Parallel Virtual Machine (PVM) and Heterogeneous Adaptable Reconfigurable NEtworked SystemS (HARNESS) research efforts at ORNL. Stephen has a Ph.D. and M.S. in computer science and is a member of ACM, IEEE Computer, and the IEEE Task Force on Cluster Computing.

Personal www.csm.ornl.gov/~sscott
Cluster tools www.csm.ornl.gov/ClusterPowerTools
TORC www.csm.ornl.gov/torc
PVM www.csm.ornl.gov/pvm
HARNESS www.csm.ornl.gov/harness
OCG www.OpenClusterGroup.org
OSCAR www.OpenClusterGroup.org/OSCAR

Evolution of Linux towards Clustering

Andrea Arcangeli, Suse

The presentation will address some of the recent innovations in the linux kernel especially related to the new kernel 2.6, and we'll analyze how
these can affect and improve clustering and distributed/parallel computing.

Andrea Arcangeli (andrea@suse.de)

Andrea Arcangeli works for SuSE as kernel developer, on many parts of the linux kernel including memory management, scheduler,
I/O subsystem, x86-64 port, and networking. His primary object is to make linux always more reliable, performant, responsive and scalable.

High-Availability Clusters using Linux-HA

Alan Roberson, IBM

High-Availability (HA) clustering is a clustering technique where services are provided by the cluster as a whole, rather than by individual servers. Failure of individual nodes and services are recovered using redundancy in the cluster. The Linux-HA project is the oldest and best known open source HA project on Linux.

This talk will discuss the Linux-HA project - it's capabilities, limitations and future plans. In addition, we will also discuss the Open Clustering
Framework project which is defining clustering APIs (HA and HPC) for Linux.

Alan Robertson


Alan Robertson has been an active developer and project leader for High-Availability Linux for the last several years. He maintains the Linux-HA project web site at http://linux-ha.org, and has been a key developer for the open source heartbeat program. He worked for SuSE for a year, then joined IBM's Linux Technology Center in March 2001.

Alan also jointly leads the Open Cluster Framework effort (http://opencf.org/) to define standard APIs for clustering, and provide an open source reference implementation of these APIs.

Before joining SuSE, he was a Distinguished Member of Technical Staff for Bell Labs. He worked for Bell Labs 21 years in a variety of roles. These included developing products, designing communication controllers and providing leading-edge computing support.

He obtained an MS in Computer Science from Oklahoma State University in 1978 and a BS in Electrical Engineering from OSU in 1976.

Deploying Clusters at Electricité de France
Jean-Yves Berthou, EDF R&D

Abstract: For decades, through its missions for the EDF Group, EDF R&D Division has been developing and maintaining scientific applications. At the end of the 90's, EDF R&D engaged a deep and large questioning about software architecture of its applications and the organization of its computing facilities. This reorganization of the scientific computing has made clusters of PCs possible target machines for departmental or project uses.

The CALIBRE project has been launched in 2000 in order to spread PC cluster technology at EDF R&D. Its objectives was to study of the technical feasibility of such platform, evaluates its associate cost (TCO), developing expertise and build a service offer. The deployment of cluster have now outreach the R&D division.

This talk will present the results of the CALIBRE project and the roadmap for the deployment of clusters at EDF.

Jean-Yves Berthou


Jean-Yves Berthou has been a researcher for EDF R&D since 1997. He regularly teaches computer science at various French Universities and Engineering Schools. He received a Ph.D in computer science from "Pierre et Marie Curie" University (PARIS VI) in 1993. He also worked two years for the CEA, the French National Atomic Agency, as an expert in High Performance Computing. His current research and teaching deals mainly with Parallel Programming and Software Architecture for scientific computing.

Jean-Yves Berthou is currently the head of the Applied Scientific Computing Group at EDF R&D. The main issues the group have in charge are Software Architecture, Code Optimization, High Performance Computing, Cluster and Grid Computing.

CLIC project

Yves Denneulin, ENSIMAG

CLIC, which stands for "Cluster LInux pour le Calcul", is a project sponsored by the French government. Its aim
is to popularize the use of clusters with an easy to install GPL clustering suite.
To meet this requirement, CLIC combines mandrake linux expertise, for easy installation and HPC specific hardware
support, and a growing set of tools developped by researchers, in a 32 & 64 bits processors Linux Distribution.
These tools includes some management/deployment tools, developped at the ID laboratory, and already used on its
clusters, and some contributions that allows research teams from various fields (bioinformatics, astronomy, ...)
to easily distribute their software through rpms.

Cluster Computing at Compagnie Générale de Géophysique

Jean-Yves Blanc, CGG

Compagnie Générale de Géophysique (CGG) is a leading supplier of geophysical products and services to the worldwide oil and gas industry. CGG's Sercel ® subsidiary produces seismic sources and data acquisition equipment. Based in Paris, France, CGG works worldwide on land and offshore to gather seismic data. CGG's processing and reservoir services offer seismic data management and processing as well as reservoir geophysics activities.

As seismic processing implies to have the capacity to handle terabytes of data and is also really compute intensive, CGG had always been a major European supercomputing player. CGG is now operating clusters at very large scale to process our seismic data.
Our basic goal is to achieve better price/performance ratio than the NUMA systems clusters were supposed to replace. Due to clusters nature many interesting challenges had to be overcomed by our design and our operational mode, in order to achieve the best TCO and performance:
· based on comodity hardware, clusters nodes have limited reliability
· clusters are not well balanced for intensive calculations
· are complex systems, large scale industrial operation is far from trivial
· evolution of computer hardware make them obsolete in a few months, short term leases are expensive and obsolescence is disruptive for large processing centers

CGG operates today more than 15 000 CPUs in clusters worldwide. This huge computing power, combined with CGG's Geocluster, allows us to efficiently produce quality products for all our oil and gas clients, day after day.

Jean-Yves Blanc

Jean-Yves Blanc is the IT Architect for the Processing & Reservoir Business Unit. He defines the IT strategy and architects the processing centers IT infrastructures in collaboration with the Operational units. J-Y. Blanc joined CGG in 1992. Prior to his current position inside CGG, he was the head of the Parallel Computing Development Group. He is holding a Ph. D. of Applied Maths from the Institut Polytechnique de Grenoble, France. He is reporting to Laurent Vercelli, manager of the IT Industrialisation Department. CGG is headquarted in Paris, France.