POM: a Parallel Observable Machine

Version française
POM is a virtual parallel machine featuring mechanisms for observing distributed applications. Its primary goal is to mask the specificities of the various communication kernels of today's machines with no significative degradation of performances. It comes in the form of a library built upon the many communication kernels available on current parallel architectures and a loader that provides the user with an homogeneous syntax for launching parallel applications on any parallel platform. The communication services offered by POM are basic services, but their semantics is clearly defined and they can be easily and efficiently implemented. POM thus fits especially well the design of applications for which performances are the primary concern.

POM defines a model of virtual machine that consists of a set of application nodes that communicate via two distinct networks: the first one is dedicated to point-to-point communications while the second one is used for broadcasting messages.

We also gave POM sophisticated observation mechanisms. POM can be perceived as a convenient facility to interface a distributed application with trace analysers and graphical viewers. Observation mechanisms were incorporated at a low level in the library, in order to be as less intrusive as possible.

The observation technique fostered is based on the analysis of execution traces, rather than on a direct observation of distributed applications. Besides the application nodes, the virtual machine can optionally include a complementary observation node whose role is to collect and handle trace information relative to the behaviour of the application. The observation node can proceed to an ``on the fly'' analysis of the information received, or it can store this information for a post-mortem analysis.

POM offers several dating mechanisms, whose management remains fully transparent to the application programmer and that can all be enabled or disabled separately when loading the application. The events traced can be stamped and/or dated, and the dating can be achieved according to a local or global time reference. Stamping events (using vectorial or adaptative stamps) makes it possible to analyse the synchronisations that occur between the application nodes during a distributed execution. POM also incorporates a mechanism for dating events globally. We opted for an approach based on a statistical method that is not intrusive, and that consists in estimating the drift of the physical clock of each application node with respect to a reference clock.

To date, POM has been ported on several platforms, including the Intel machines iPSC/2 and Paragon XP/S, and a network of Sun Sparc workstations. We could thus check its effective portability and it is now part of the various parallel programming environments developed in our laboratory. In the future, we consider porting POM on new platforms, such as the Cray T3D and the IBM SP1.

Postscript document available : F. Guidec, Y. Mahéo, "POM: a Virtual Parallel Machine Featuring Observation Mechanisms". Internal publication #902, IRISA, janvier 1995. (Also available as a research report, Inria #2473.)

($Date: 1997/01/03 13:13:29 $)