![]() |
| Pr. Michel Raynal | Membre de l'Institut Universitaire de France |
Professional |
Scientific Achievements |
Scientific Achievements6 Scientific achievements: past
I list and comment here only a subset of my previous works. A reference of the type [Rx] refers to a journal paper listed in Section 11, while a reference of the type [Cy] refers to a paper that appeared in a conference and is listed in Section 12.
Books I consider that the eight books and the many surveys I have written (e.g., in journals [R31,R43,R47, R65, R68,R72,R85,R87] or conferences [C123,C149,C150,C164,C201,C229,C234]) are a part of my research contribution. Establishing new results is fundamental, but is only the front of the coin. The obverse of the coin consists in disseminating and transmitting them to the students and colleagues.
Very early research (until 1984) My early work was on operating systems and abstract data types. Then, I started working of communication systems with Gregor von Bochman [R5]. It is during this “warm up” research period that I became interested in distributed computing.
Failure-free distributed synchronization My very first interest in distributed computing has been the mutual exclusion problem. I wrote a book on that topic and designed (with J.-M. Helary and N. Plouzeau) one of the very first algorithms for arbitrary networks [R11]. A few years later, I designed (with Helary and Mostefaoui) a very general token-based mutex algorithm in which the token moves on an abstract tree that can be dynamically modified by an adversary daemon [R22]. (Interestingly, each of the mutex algorithms that uses a token moving along a tree corresponds to a particular behavior of the underlying daemon.) In the same spirit, I designed an algorithm for the h-out-of-k resource allocation problem [C30]. While working on that topic, I entered the domain of quorums, and (with M. Mizuno) I introduced a general method to define and compose quorums [C33]. I have also investigated the notion of k-arbiter [R39] to address the h-out-of-k resource allocation problem.
Detection of stable/unstable properties My interest in the detection of stable properties started with my first PODC paper (1987) [C12] that presents a very general distributed detection algorithm for a large class of stable properties (properties on system global states). These properties are such that, once true, they remain true forever. Then, I addressed the case of algorithms targeted for specific stable properties, mainly distributed termination detection [C42] and distributed deadlock detection [R30]. Then, motivated by a project on distributed debugging, I started working on the detection of unstable properties. Here, the additional difficulty comes from the fact that such a property can be satisfied only intermittently. My main contributions to this topic are described in [R26,R28,R41,R48,C43,C46,C56,C66]. Among them, [R28] was one of the very first papers addressing properties defined on the many control flows present in a distributed execution, while [R41] introduced the notion of inevitable global state and defined an algorithm to detect them (such a state is a state that is seen by all the sequential observers of the corresponding distributed execution). Finally the work described in [R48] concerns the detection of conjunction of local predicates; it presents one of the most efficient algorithms proposed so far to detect on the fly such predicates.
Data consistency Distributed computing involves distributed data. This part of my work has many facets. In [R51], Vijay Garg and I introduced the normality consistency condition. Its definition is based only on the local order of operations as perceived by each process and by each object. If each operation is on exactly one object, normality and linearizability7F8 are the same. Differently, when operations span several objects, normality is weaker than linearizability. I have also proposed several protocols to implements sequentially consistent memories [R74,C47,C54] and shown that sequential consistency can be seen as a form of “lazy linerarizability” [C148]. I also investigated (mainly with M. Ahamad from Georgia Tech) the notion of timed consistency for shared distributed objects. This work was published in the top conferences PODC and DISC [C97,C100] and in journals (e.g., [R71]).
Causal order An important part of my past work was on causality in message-passing systems and its applications. One of my early work on causality is a protocol to deliver messages according to the so-called causal order. This simple and elegant protocol (designed with A. Schiper and S. Toueg), that appeared in IPL [R17], is widely referenced and appears in several textbooks devoted to distributed computing.
Checkpointing Then, my interest in causality moved to the checkpointing problem, and more specifically to communication-induced checkpointing (CIC). This checkpointing technique allows the application messages to piggyback control information, but does not allow the use of additional control messages. In this context, we (I with mainly J.-M. Helary, A. Mostefaoui) produced several results, among which the following ones. - An important theoretical question is “Given an arbitrary set of local checkpoints, do these local checkpoints belong to the same consistent global checkpoint?” While this question has been answered by Netzer and Xu in the particular context of message passing systems8F9, we answered it in a very general asynchronous computational model that encompasses shared memory systems and various message passing systems with reliable or unreliable and point-to-point or multicast or broadcast communication. This result has been published in Acta Informatica [R45]. - A very general definition of global checkpoint consistency that appeared in IEEE Transactions on Software Engineering [R49]. - A communication-induced snapshot algorithm that appeared in IEEE TPDS [R53]. This algorithm is very general and can be instantiated in many ways. It shows that consistent global states can be determined without the help of additional control messages (differently from the well-known Chandy and Lamport’s snapshot algorithm that uses additional control messages called markers). - A family of algorithms that allows the processes to define independent local checkpoints in such a way that any local checkpoint is part of a consistent global checkpoint. This work appeared in Distributed Computing [R57]. - The impossibility to design scalar-based communication-induced checkpointing protocols that satisfy the Rollback-Dependency Trackability property. This work appeared in Information processing Letters [R61]. - A Characterization of the Rollback-Dependency Trackability property. This work appeared in Information and Computation [R62]. All this work participated in providing checkpointing with solid theoretical foundations.
Virtual precedence In [R70,C87], Helary, Mostefaoui and I introduced and investigated the concept of virtual precedence. The problem is the following. An interval of a sequential process is a sequence of consecutive events of that process. The set of intervals defined on a distributed computation defines an abstraction of this distributed computation, and the traditional causality relation on events induces a relation on the set of intervals. The question is then: “Is the interval-based abstraction associated with a distributed computation consistent?” To answer this question, this paper introduces the Interval Consistency (IC) condition. Intuitively, this condition states that an interval-based abstraction of a distributed computation is consistent if its precedence relation does not contradict the sequentiality of each process. Interestingly, the IC condition can be operationally characterized in terms of timestamps (whose values belong to a lattice). The paper uses this characterization to design a versatile protocol that, given intervals defined by a daemon whose behavior is unpredictable, breaks them (in a non trivial manner) in order to produce an abstraction satisfying the IC condition. (Among other problems, communication-induced checkpointing can benefit from IC.)
Optimal implementation of vector clocks The major parts of the previous protocols are based on vector clocks, whose size is equal to the number of processes. So, an important question is the following “Given a message m, which is the minimal subset of the entries of a vector clock that m has to piggyback in order to fully capture causality?” We answered this question by stating a necessary and sufficient condition, and showed how it can implemented. This is an important result as, for each message m, it states the minimal quantity of information m has to carry in order the vector clock system allows the processes to fully capture the causality relation [R57]. Moreover, we also designed the first (to our knowledge) distributed algorithm that computes on the fly the transitive reduction associated with the partial order defined by a distributed execution [R73,C144] (the transitive reduction captures exactly the minimal partial order associated with a distributed execution).
7 Main research contributions: close past and part of present
Since 10 years my research is mainly focused on algorithms for distributed agreement, with a recent incursion in distributed computability. One of my first work in distributed agreement is a (co-authored) paper titled “From group communications to transactions in distributed systems” that appeared in a 1996 special issue of Communications of the ACM devoted to “group communication” [R36]. Then, I became interested in the consensus problem, its variants such as the k-set agreement, and the design of protocols implementing failure detectors.
I designed several consensus algorithms in asynchronous message-passing systems or shared memory systems for the crash failure model [R54,R67,R69,R82,R83,R95,C110,C134] or the Byzantine failure model [R79,R89,R90]. More specifically, [C110] (DISC 1999) presents a very simple consensus algorithm that is generic in the sense that it can be instantiated with any failure detector of any failure detector class as defined by Chandra and Toueg9F10. This algorithm was the first failure detector-based consensus algorithm that uses quorums in an explicit way to ensure the agreement property. Interestingly enough, the pattern on which this algorithm is based is very similar to the adopt/commit pattern proposed by Gafni11. Many consensus algorithms proposed after ours, follows our pattern. Similarly, the leader-based consensus protocol described in [R67], that was one of the very first Omega-based protocol, is widely referenced in the literature. I have been one of the very first researcher to investigate the situations where consensus can be solved in one communication step. The corresponding paper [C134] is also widely referenced. In the same spirit, a transformation from binary consensus to multi-valued consensus is presented in [R58]. Very recently, I have revisited (with Y. Moses) the simultaneous consensus problem [C228] and shown that it cannot benefit from the condition-based approach. “Simultaneous” means that the processes that decide have to decide during the very same round. I have investigated the use of consensus to solve agreement problems such as total order multicast in overlapping groups [R60], atomic broadcast in crash/recovery systems whose aim is to implement quorumbased replication [R77], and how to save consensus executions when one has to implement atomic broadcast [C124].
I have addressed several directions of research related to failure detectors.
Computing with failure detectors In [R59,R76] (with co-authors) I investigated the Global Data Computation problem (also known under the name Interactive Consistency)12. It consists in providing each processwith the same vector (with one entry per process) such that each entry is filled with the value supplied bythe corresponding process if it has not crashed, and its value or a default value otherwise. The paper [R76] presents analgorithm, based on a perfect failure detector, that requires the processes to execute at most min(f +2, t+1) rounds (where t is the model upper bound on the number of processes that can crash, and f the actual numberof crashes). This showed that solving that problem in an asynchronous system enriched with a perfectfailure detector is not more expensive than solving it in a synchronous system.
From one failure detector to another one With A. Mostefaoui, I introduced the notion of failure detectors with limited scope accuracy [C111,C121] (they are failure detectors whose scope is limited to a subset of x has of the n processes). [R81] presents a necessary and sufficient condition that allows transforming a failure detector with limited scope accuracy into its non-limited scope counterpart. This condition states that the scope x of the failure detector has to be greater than t (the upper bound on the number of processes that can crash). Other failure detector transformations are described in [R56] and [R94]. This last JPDC paper presents a transformation that is quiescent (after some finite time, the transformation does not require message to be exchanged).
On the weakest failure detector classes I have shown in [R97] that there is no one-shot agreement problem for which the failure detector class <>P is the weakest that would allow it to be solved (this is the class of eventually perfect failure detectors: after some finite time they do suspect only the processes that have crashed). The papers [R103,C198] study classes of failure detectors whose power can be added. The aim is here to define an “arithmetic” of failure detector classes.
Implementing failure detectors Traditional implementations of failure detectors consider that the system satisfy additional synchrony assumptions. In [C160,R93] I proposed (with Mostefaoui and Mourgaya) a totally new approach to implement failure detectors. That approach relies on the pattern of messages exchanged by the processes. Basically, this novel assumption requires that there is a process whose answers to queries are among the n-t first answers received by any querying process. I have then shown that this novel kind of assumption can be combined with timing assumptions associated with a subset of the channels [R92]. It is important to see that such an approach favors the assumption coverage. In [R113] is presented a weak timing assumption that allows the construction of an eventual leader service (Omega) in an asynchronous shared memory system, while [C221] considers the case of a message-passing system. Unifying these two approaches is still an open problem.
The consensus problem is one of the most important problem in asynchronous systems prone to failures: it captures the difficulty of coordinating independent entities when these entities can fail. Each process proposes a value, and all the non-faulty processes have to agree on the same value that has to be one of the proposed values. Unfortunately, despite its very simple formulation, there is no deterministic algorithm that can solve the consensus problem as soon as (even only) one process can crash. This is the famous FLP impossibility result11F13. Until 2000, mainly three approaches were proposed to circumvent this impossibility. One consists in looking for a non-deterministic solution: a process can draw random numbers12F14. Another consists in enriching the system with a failure detector of an appropriate class. The last one, that concerns shared-memory systems, consists in enriching the system with operations stronger than the base read and write operations (e.g. a Compare&Swap() operation). This approach has given rise to the notion of consensus number and to Herlihy’s hierarchy13F15. In 2000, with S. Rajsbaum and A. Mostefaoui (during a visit of Sergio R. at IRISA) we started a collaboration that proved to be very fruitful, namely we introduced a novel way of circumventing the FLP impossibility result. This new approach does not consist in enriching the underlying system, but in restricting the set of input vectors (such a restricted set defines a condition). The main challenge is then to characterize the largest conditions that allows solving consensus in presence of up to t process crashes. This is the problem solved in [R78]. Then, [R80] establishes a hierarchy of conditions and presents corresponding efficient algorithms. While the result in [R78] establishes the frontier from which a condition allows solving consensus despite asynchrony and up to t process crashes, we showed in [R91] that a condition can allow expediting consensus in a synchronous system. Said differently, [R91] extends to synchronous systems the hierarchy of conditions, namely, a condition that allows solving consensus despite t crashes in an asynchronous system, allows consensus to be solved optimally (i.e., in two rounds) in a synchronous system: decidability in an asynchronous system can be converted in efficiency in a synchronous system. In a very interesting way, we have shown that there is a very strong connection between error-correcting code and distributed agreement problems [R98] (intuitively, an input vector encodes a value that has to be decided by the processes). We have also investigated the combined power of conditions and failure detectors to solve agreement problems [R105]. Finally, with my PhD student F. Bonnet, I have recently extended the condition-based approach to address the k-set agreement problem [R110]. This provides us with a very general condition-based framework.
The k-set agreement problem is a weakening of the consensus in the sense that up to k different values can be decided. I worked on this problem in failure prone synchronous and asynchronous systems.
Set agreement in synchronous systems The main result in this type of systems is described in [R112], where the fault model is the general omission failure model. A new termination property is introduced. Let a good process be process that neither crashes nor commits receive omission failure. The new property, called strong termination, obliges the good processes to decide. A k-set algorithm is presented where every good process decides and halts by round min(f/k+2,t/k+1). This algorithm is clearly optimal. In addition to the strong termination property, (as far as I know) this algorithm is the only early-deciding and stopping k-set agreement algorithm proposed so far for general omission failures. A survey of synchronous set agreement is presented in [201].
Set agreement in asynchronous systems Differently from synchronous systems, k-set agreement cannot be solved in asynchronous systems when k<=t. As already indicated, [C121] investigates failure detectors with limited scope accuracy to solve k-set agreement, and presented several protocols. Differently, [C131] considers that the additional power is supplied by random numbers. [R105,C184] add the power of appropriate failure detector classes to conditions in order to solve k-set agreement. In the context of shared memory systems, the invited paper [C205] posed the problem of the weakest failure detector for solving the k-set agreement problem. At the rump session of PODC 2007, I posed the following conjecture at the rump session “Anti-Omega_k is the weakest failure detector for k-set agreement in shared memory systems”. This conjecture has been positively answered by three independent groups of researchers (PODC 2009). I am now working for the weakest failure detector for k-set agreement in message-passing systems.
8 Scientific achievements: current research
Whatever their granularity, today distributed applications are pervasive and benefit everyone (e.g., P2P, cloud computing, sensors networks, or social networks for “large grain” applications, and multicore for “small grain” applications). All these applications are becoming larger and larger and more and more distributed. The development of such platforms and their usage have somehow preceded their theoretical foundations. Up to now, their design principles look sometimes more like “tricks” than well-mastered basic principles. The explosion of the number of distributed applications and the number of “computing adversaries” such as scaling, misbehaviors (also characterized as malicious behaviour when referring to entities attempting to voluntarily or not hurt the system), dynamicity, etc., makes their basic principles more and more difficult to grasp. Traditional algorithms simply do not fit this challenging new setting and it is required to revisit the field. Hence, research addressing distributed computing theory that can benefit future applications is more needed than ever. This is a great challenge for the computer science community. My current research contributes to it.
In addition to asynchrony and failures, I started recently to consider two new “adversaries”, namely, the dynamicity of the system (processes can join and leave), and anonymity (the processes have no identity). I do think that dynamicity and anonymity are inescapable facets of uncertainty that distributed computing has to address in the near future.
Facing dynamicity Concerning dynamicity, my first results concern the election of an eventual leader [C185]. This work shows that a subset of the processes (a core) has to remain long enough in the system in order one of them can be elected. This entailed a research on timed quorums systems for large scale dynamic environments [C222]. From a different point of view, [C214] investigates the properties required from the dynamically changing underlying communication graph. Very recently, I addressed dynamic systems such as small worlds [C220], and sensors networks from which a geometric structure has to emerge without the help of external devices such as beacons [C232].
Facing anonymity in agreement problems On the anonymity side, together with my PhD student F. Bonnet, I very recently showed that anonymity doubles the price when one has to solve consensus [C242] (the number of rounds is multiplied by 2). We proved the corresponding lower bound, namely min(2f +2, 2t + 1), and extended our results to the k-set agreement problem.
Anonymous robots on partial grids In 2008, I started working (with R. Baldoni from La Sapienza, Roma) on anonymous robots that have to synchronize their moves to traverse an incomplete grid. Our work has two complementary facets. One concerns computability (which assumptions are necessary and sufficient in order the robots can move), while the other concerns the design of the algorithms each robot has to execute in order to traverse infinitely often the partial grid (without colliding on a vertex or an edge of the grid). Our preliminary results are described in [R104,C229]. A main difficulty comes from the fact that the robots cannot communicate among themselves, they can only see a part of the grid (defined by their current location and a fixed radius).
Software transactional memory I started becoming interested in software transactional memories (STM) in 2007. Basically, the problem consists in discharging the application programmer from the management of the underlying synchronization in multiprocess programs that access shared objects. From a practical point of view, STM systems are one of the most promising approach to take up challenge posed by the recent advance of multicore architectures and the deployment of multiprocessors as the mainstream computing platforms. From a theoretical point of view, STM systems give a new impetus that forces to rethink the way synchronization problems have to be solved (basically, synchronization is coming back, but it is not the same [C224]). My initial contributions concern the following points. While the fate of a transaction is to commit or abort, no current STM specification states situations where the STM system is forced to commit a transaction. In [C231], with my PhD student D. Imbs, I introduced a new property called obligation that specifies situations where a transaction cannot be aborted. A corresponding STM protocol is presented and formally proved correct. The second contribution is the definition of a general framework to state consistency conditions suited to STM systems [C236]. This framework not only encompasses serializability, strict serializability and opacity (that are “traditional” consistency conditions), but permits to define new meaningful consistency conditions. Among them the (new) virtual world condition is particularly interesting: it is less restrictive than than opacity, while requiring that (whatever the fact that it commits or aborts) any transaction always reads values from a consistent global state.
Wait-free synchronization As far as the renaissance of synchronization is concerned, D. Imbs and I designed a new wait-free algorithm for partial snapshot [C242]. This algorithm is more efficient than the algorithms previously proposed: (1) an update helps only the concurrent snapshot invocations that conflict with it, and (2) the values returned by a partial snapshot are “as fresh as possible”.
I started becoming interested in distributed computability in 2006. I worked on this topic mainly with E. Gafni, Rajsbaum and C. Travers (who, at that time, was one of my PhD students).
Committee decision and renaming My first (co-authored) contribution was the definition of the committee decision problem [C192] that was later generalized to the notion of simultaneous consensus task [C204].The question answered is the following: “What is the power of the task where processes are involvedin k simultaneous consensus instances and each is required to decide in only one of them (distinctprocesses deciding possibly in different instances)?” My second contribution is a wait-free k-Test&Set-based adaptive renaming algorithm [C200]. The paper shows that the new space name can be reduced from 2p-1 (where p is the number of participating processes) to 2-(p/k). This work encouraged us to investigate the relation linking Test&Set, adaptive agreement and set agreement. This investigation resulted in [C216,C223]. As Test&Set, adaptive agreement and set agreement are sub-consensus tasks (i.e., they are weaker than consensus), the holy grail is here to establish a hierarchy of subconsensus tasks (if any), similarly to the hierarchy of consensus tasks as defined by Herlihy. This line of research is fundamental if we want to understand the power of base synchronization primitives. In some sense, the research agenda is here to establish the “Mendeleiev’s table” of sub-consensus tasks. A very early step in that direction appears in [R106]. I also proposed and investigated (with G. Taubenfeld) the notion of timed register [C208]. A timed register generalizes the notion of an atomic register as follows: if a process invokes two consecutive operations on the same timed register which are a read followed by a write, then the write operation is executed only if it is invoked at most d time units after the read operation, where d is defined as part of the read operation. We show that a timed register is a universal object (i.e., an object from which any object defined by a sequential specification can be built despite asynchrony and failures)16. [C223] explores a new direction to solve the k-set agreement problem in a synchronous system of n processes. It considers that the system is enriched with base objects (denoted that allow solving the g-set agreement problem in a set of m processes (m < n). This work has several contributions. It first proposes a synchronous k-set agreement algorithm that benefits from such underlying base objects. This algorithm requires O((t*g) /(m*k)) rounds, more precisely, t/Delta +1 rounds where , Delta= m (k/l) + (k mod g). It also shows that this bound, that involves all the parameters that characterize both the problem (k) and its environment (t, m and g) is a lower bound. This work is then extended to the early deciding case. It presents a k-set agreement algorithm that directs the processes to decide and stop by round min((f/Delta)+2,(t/Delta)+1). These bounds generalize the bounds previously established for solving the k-set agreement problem in pure synchronous systems.
Formal models for distributed computing The Iterated Immediate Snapshot (IIS) model has been introduced by E. Borowsky and E. Gafni14F17. It is an asynchronous computation model where processes communicate through a sequence of one-shot immediate snapshot objects. It is known that this model is equivalent to the usual asynchronous read/write shared memory model, for wait-free task solvability. Its interest lies in the fact that its runs are more structured and easier to analyze than the runs in the shared memory model. As the IIS model and the shared memory model are equivalent for wait-free task solvability, a natural question is the following: “Are they still equivalent for wait-free task solvability, when they are enriched with the same failure detector?” Rajsbaum, Travers and I showed in [R100] that the answer to this question is “no.” At first glance, this answer can appear counter-intuitive. So, the next question is the following: “Given a shared memory model enriched with a failure detector, what is an equivalent IIS model?” [C226] shows that an elegant way of capturing the power of a failure detector and other partially synchronous systems in the IIS model is by restricting appropriately its set of runs, giving rise to the Iterated Restricted Immediate Snapshot model (IRIS). 8Herlihy M.P. andWing J.M., Linearizability: a Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems, 12(3):463-492, 1990. 9Netzer, R.H.B. and Xu, J., Necessary and Sufficient Conditions for Consistent Global Snapshots, IEEE Transactions on Parallel and Distributed Systems, 6(2):165-169, 1995.
10Chandra T.D. and Toueg S., Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM, 43(2):225-267, 1996. 11Gafni E., Round-by-round Fault Detectors: Unifying Synchrony and Asynchrony. Proc. 17th ACM Symposium on Principles of Distributed Computing (PODC’00), ACM Press, pp. 143-152, 1998.
12 Pease L., Shostak R. and Lamport L., Reaching Agreement in Presence of Faults. Journal of the ACM, 27(2):228-234, 1980. 13Fischer M.J., Lynch N.A. and Paterson M.S., Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM, 32(2):374-382, 1985. 14Ben-Or M., Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols. Proc. 2nd ACM Symposium on Principles of Distributed Computing (PODC’83), pp. 27-30, 1983. Rabin M., Randomized Byzantine Generals. Proc. 24th IEEE Symposium on Foundations of Computer Science (FOCS’83), IEEE Computer Press, pp. 403-409, 1983. 15Herlihy M.P.,Wait-Free Synchronization. ACM Transactions on Programming Languages and Systems, 13(1):124-149, 1991.
16 Herlihy M.P., Wait-Free Synchronization. ACM Trans. on Programming Languages and Systems, 13(1):124-149, 1991. 17Borowsky E. and Gafni E., Immediate Atomic Snapshots and Fast Renaming. Proc. 12th Principles of Distributed Computing (PODC’93), ACM Press, pp. 41-51, 1993.
|