Solidor research

Stardust

A runtime system for parallel applications above heterogeneous workstations

Members

Gilbert Cabillic, Isabelle Puaut

Objectives

The proliferation of inexpensive and powerful workstations has continued to increase at a rapid rate in the last few years. This increase of machine performance is likely to continue for several years, with faster processors and multiprocessor machines. Studies have shown that for a large percentage of their lifetime, the machines are used for small tasks (reading e-mail, editing files), thus demonstrating an average idle percentage of at least 90% even during peak hours. One possible use of these idle cycles is to run parallel applications on networks of workstations. The network of workstations can then be considered as a parallel computer, or hypercomputer whose performance is similar to the one of a parallel machine for a much lower cost. Numerous research activities have tried to exploit the computing power of networks of workstations, like PVM, and MPI, based on the message-passing paradigm, and Linda, Mermaid, based on the shared-memory paradigm. To be fully usable, an environment for parallel computing on networks of workstations should include the following facilities:

Support for multiple programming paradigms: two classes of programming paradigms for parallel applications can be distinguished: the message-passing programming paradigm, in which processes communicate through message exchanges, and the shared-memory paradigm, in which processes communicate by reading and writing shared data items. Although all applications can be written using both paradigms, some studies have shown that application performance can be increased by the use of both message-passing and shared memory.
Support for heterogeneity: existing computing environments often include a number of computers with various architectures and performance (parallel machines, workstations) and different operating systems; more computing power is then available to environments supporting heterogeneous machines.
Support for load balancing and application reconfiguration: the concept of ownership is frequently present when using workstations for executing parallel applications. Workstation owners do not want their machine to be overloaded by the execution of parallel applications, or simply want exclusive access to their machine when they are working. Reconfiguration mechanisms are thus required to balance the machines loads, and to allow parallel computations to coexist with other applications.
Support for fault tolerance: as the number of nodes achieving a parallel computation increases, possibility of a node failure also increases. The failure can be a hardware failure or may be a reboot done by the owner of the workstation. Without provision for tolerance to such failures, it is necessary to restart the entire computation.

Contributions

Stardust is an environment that provides all these facilities. Processes executing on Stardust can communicate both through message-passing and (page-based) distributed shared memory (DSM). Library calls are provided for sending/receiving a message, as well as for creating a shared memory region and mapping it onto a process address space. Stardust executes on an heterogeneous computing environment composed of a parallel machine (the Intel Paragon) and PCs connected by an ATM network.

Stardust includes transparent mechanisms for converting data between different types of hosts. These mechanisms are used for converting both the contents of communication buffers and the contents of shared virtual memory regions. In addition, in order to cope with the processors different instruction formats, each program developed with Stardust is compiled for every type of architecture.

In addition to its support for heterogeneity, Stardust includes a mechanism used both for checkpointing and for load balancing. The basic principle of this mechanism is to consider an application state only when all the application processes have reached a synchronization point (synchronization barrier) in their main program (C function main). The application state can then be saved onto disk to support node failures, or moved to other nodes for balancing the system load. An architecture-independent standard representation of data is used for saving an application state. No architecture-dependent data (e.g., threads stacks) is saved onto disk (respectively transferred to other nodes). Hence, migration of applications between heterogeneous hosts is made possible.

Publications

Gilbert Cabillic and Isabelle Puaut.
Stardust: an Environment for Parallel Programming on Networks of Heterogeneous Workstations.
Journal of Parallel and Distributed Computing, 40(1), January 1997.
You can download its postscript version (93K).
Gilbert Cabillic and Isabelle Puaut.
Stardust: an environment for parallel programming on networks of heterogeneous workstations.
IRISA Technical Report No.1006, April 1996.
You can read the abstract of the paper, or download its postscript version (128K).
Gilbert Cabillic and Isabelle Puaut.
Dealing with heterogeneity in Stardust: an environment for parallel programming on networks of heterogeneous workstations.
In Proceedings of the Euro-Par'96 conference, August 1996.
You can read the abstract of the paper, or download its postscript version (31K).
Gilbert Cabillic and Isabelle Puaut.
Répartition de charge dans Stardust : un environnement pour l'exécution d'applications parallè en milieu hétérogène.
Actes de l'école d'étéplacement dynamique et répartition de charge, Collection didactique Inria, Presqu'ile de Giens, France, Juillet 1996.
You can read the abstract of the paper, or download its postscript version (25K).
Christine Morin and Isabelle Puaut.
A survey of recoverable distributed shared memory systems.
IEEE Transactions on Parallel and Distributed Systems, 8(9), September 1997.
Also appears as IRISA Technical Report No.975, (gziped postscript 86K), December 1995.
You can read the abstract of the paper.
Gilbert Cabillic, Gilles Muller and Isabelle Puaut.
The performance of consistent checkpointing in distributed shared memory systems
In Proceedings of the 14th symposium on reliable distributed systems, September 1995.
You can read the abstract of the paper, or download its postscript version (57K).

dernière mise à jour : 17 02 2000
	french version		puaut@irisa.fr		©copyright