SCIENTIFIC ACHIEVEMENTS
1. Scientific achievements: 1984-2000
I list and comment here only a a subset of my previous works. A reference of the type [Rx] refers to a peer-reviewed journal paper listed in the publication lists (sections 6.7-6.13) the end of this section, while a reference of the type [Cy] refers to a paper that appeared in a peer-reviewed conference.
Books
I consider that the books and the many surveys I have written (e.g., in journals [R31,R43,R47, R65, R68,R72,R85,R87] or conferences [C123,C149,C150,C164,C201,C229,C234]) are a part of my research contribution. Establishing new results is fundamental, but is only the front of the coin. The obverse of the coin consists in disseminating and transmitting them to the students and colleagues.
1.1 Early research
□ Very early research
My early work was on operating systems and abstract data types. Then, I started working of communication systems with Gregor von Bochmann [R5]. It is during this “warm up” research period that I became interested in distributed computing.
□ Failure-free distributed synchronization
My very first interest in distributed computing has been the mutual exclusion problem. I wrote a book on that topic and designed (with J.-M. Helary and N. Plouzeau) one of the very first algorithms for arbitrary networks [R11]. A few years later, I designed (with Helary and Mostefaoui) a very general token-based mutex algorithm in which the token moves on an abstract tree that can be dynamically modified by an adversary daemon [R22]. (Interestingly, each of the mutex algorithms that uses a token moving a tree corresponds to a particular behavior of the underlying daemon.)
In the same spirit, I designed an algorithm for the h-out of-k resource allocation problem [C30]. While working on that topic, I entered the domain of quorums, and (with M. Mizuno) I introduced a general method to define and compose quorums [C33]. I also investigated the notion of k-arbiter [R39] to address the h-out of-k resource allocation problem.
□ Detection of stable/unstable properties
My interest in the detection of stable properties started with my first PODC paper (1987) [C12] that presents a very general distributed detection algorithm for a large class of stable properties (properties on system global states). These properties are such that, once true, they remain true forever. Then, I addressed the case of algorithms targeted for specific stable properties, mainly distributed termination detection [C42] and distributed deadlock detection [R30].
Then, motivated by a project on distributed debugging, I started working on the detection of unstable properties. Here, the additional difficulty comes from the fact that such a property can be satisfied only intermittently. My main contributions to this topic are described in [R26,R28,R41,R48,C43,C46,C56,C66]. Among them, [R28] was one of the very first papers addressing properties defined on the many control flows present in a distributed execution, while [R41] introduced the notion of inevitable global state and defined an algorithm to detect them (such a state is a state that is seen by all the sequential observers of the corresponding distributed execution). Finally the work described in [R48] concerns the detection of conjunction of local predicates; it presents one of the most efficient algorithms proposed so far to detect on the fly such predicates.
□ Data consistency
Distributed computing involves distributed data. This part of my work has many facets. In [R51], Vijay Garg and I introduced the normality consistency condition. Its definition is based only on the local order of operations as perceived by each process and by each object. If each operation is on exactly one object, normality and linearizability are the same. Differently, when operations span several objects, normality is weaker than linearizability. (Linearizability is a classical correctness condition for concurrent objects, introduced and formalized by M. Herlihy and J. Wing in ACM TOPLAS, 12(3):463-492, 1990.)
I have also proposed several protocols to implements sequentially consistent memories [R74,C47,C54] and shown that sequential consistency can be seen as a for of lazy linerarizability [C148]. I also investigated (mainly with M. Ahamad from Georgia Tech) the notion of timed consistency for shared distributed objects. This work was published in the top conferences PODC and DISC [C97,C100] and in journals (e.g., [R71]).
1.2 Causality, checkpointing, virtual precedence and vector clocks
□ Causal order
An important part of my past work was on causality in message-passing systems and its applications. One of my early work on causality is a protocol to deliver messages according to the so-called causal order. This simple and elegant protocol (designed with A. Schiper and S. Toueg), that appeared in IPL [R17], is widely referenced and appears in several textbooks devoted to distributed computing.
□ Checkpointing
Then, my interest in causality moved to the checkpointing problem, and more specifically to communication-induced checkpointing (CIC). This checkpointing technique allows the application messages to piggyback control information, but does not allow the use of additional control messages. In this context, We (J.-M. Helary, A. Mostefaoui), and I) produced several results, among which the following ones.
• An important theoretical question is “Given an arbitrary set of local checkpoints, do these local checkpoints belong to the same consistent global checkpoint?” While this question has been answered by Netzer and Xu in the particular context of message passing systems, we answered it in a very general asynchronous computational model that encompasses shared memory systems and various message passing systems with reliable or unreliable and point-to-point or multicast or broadcast communication. This result has been published in Acta Informatica [R45].
• A very general definition of global checkpoint consistency that appeared in IEEE Transactions on Software Engineering [R49].
• A communication-induced snapshot algorithm that appeared in IEEE TPDS [R53]. This algorithm is very general and can be instantiated in many ways. It shows that consistent global states can be determined without the help of additional control messages (differently from the well-known Chandy and Lamport’s snapshot algorithm that uses additional control messages called markers).
• A family of algorithms that allows the processes to define independent local checkpoints in such a way that any local checkpoint is part of a consistent global checkpoint. This work appeared in Distributed Computing [R57].
• The impossibility to design scalar-based communication-induced checkpointing protocols that satisfy the Rollback-Dependency Trackability property. This work appeared in Information processing Letters [R61].
• A Characterization of the Rollback-Dependency Trackability property. This work appeared in Information and Computation [R62].
All this work participated in providing checkpointing with solid theoretical foundations.
□ Virtual precedence
In [R70,C87], Hélary, Mostefaoui, and I introduced and investigated the concept of virtual precedence. The problem is the following. An interval of a sequential process is a sequence of consecutive events of this process. The set of intervals defined on a distributed computation defines an abstraction of this distributed computation, and the traditional causality relation on events induces a relation on the set of intervals. The question is then: “Is the interval-based abstraction associated with a distributed computation consistent?” To answer this question, we introduced the Interval Consistency (IC) condition. Intuitively, this condition states that an interval-based abstraction of a distributed computation is consistent if its precedence relation does not contradict the sequentiality of each process. Interestingly, the IC condition can be operationally characterized in terms of timestamps (whose values belong to a lattice). The paper uses this characterization to design a versatile protocol that, given intervals defined by a daemon whose behavior is unpredictable, breaks them (in a non trivial manner) in order to produce an abstraction satisfying the IC condition.
□ Optimal implementation of vector clocks
The major parts of the previous protocols are based on vector clocks, whose size is equal to the number of processes. So, an important question is the following “Given a message m, which is the minimal subset of the entries of a vector clock that m has to piggyback in order to fully capture causality?” We answered this question by stating a necessary and sufficient condition, and showed how it can implemented. This is an important result as, for each message m, it states the minimal quantity of information m has to carry in order the vector clock system allows the processes to fully capture the causality relation [R57]. Moreover, we also designed the first (to our knowledge) distributed algorithm that computes on the fly the transitive reduction associated with the partial order defined by a distributed execution [R73,C144] (the transitive reduction captures exactly the minimal partial order associated with a distributed execution).
2. Main research contributions: 2000-2010
One of my first work in distributed agreement is a (co-authored) paper titled “From group communications to transactions in distributed systems” that appeared in a 1996 special issue of Communications of the ACM devoted to “group communication” [R36]. Then, I became interested in the consensus problem, its variants such as the k-set agreement, and the design of protocols implementing failure detectors.
2.1 Consensus
I designed several consensus algorithms in asynchronous message-passing systems or shared memory systems for the crash failure model [R54,R67,R69,R82,R83,R95,C110,C134] or the Byzantine failure model [R79,R89,R90].
More specifically, [C110] (DISC 1999) presents a very simple consensus algorithm that is generic in the sense that it can be instantiated with any failure detector of any failure detector class as defined by Chandra and Toueg. This algorithm was the first failure detector-based consensus algorithm that uses quorums in an explicit way to ensure the agreement property. Interestingly, the pattern on which this algorithm is based is very similar to the adopt/commit pattern proposed by Gafni. Many consensus algorithms proposed after ours, follows our pattern. Similarly, the leader-based consensus protocol described in [R67], that was one of the very first Omega-based protocol, is widely referenced in the literature.
I have been one of the very first researcher to investigate the situations where consensus can be solved in one communication step. The corresponding paper [C134] is also widely referenced. In the same spirit, a transformation from binary consensus to multi-valued consensus is presented in [R58].
I have revisited (with Y. Moses) the simultaneous consensus problem [C228] and shown that it cannot benefit from the condition-based approach. “Simultaneous” means that the processes that decide have to decide during the very same round. I have investigated the use of consensus to solve agreement problems such as total order multicast in overlapping groups [R60], atomic broadcast in crash/recovery systems whose aim is to implement quorum-based replication [R77], and how to save consensus executions when one has to implement atomic broadcast [C124].
2.2 Failure detectors
I have addressed several directions of research related to failure detectors.
□ Computing with failure detectors
In [R59,R76] (with co-authors) I investigated the Global Data Computation problem (also known under the name Interactive Consistency 12 ). It consists in providing each process with the same vector (with one entry per process) such that each entry is filled with the value supplied by the corresponding process if it has not crashed, and its value or ⊥ otherwise. The paper [R76] presents an algorithm, based on a perfect failure detector, that requires the processes to execute at most min(f +2, t+1) rounds (where t is the model upper bound on the number of processes that can crash, and f the actual number of crashes). This showed that solving that problem in an asynchronous system enriched with a perfectfailure detector is not more expensive than solving it in a synchronous system.
□ From one failure detector to another one
With A. Mostefaoui, I introduced the notion of failure detectors with limited scope accuracy [C111,C121] (they are failure detectors whose scope is limited to a subset of x ≤ n processes). [R81] presents a necessary and sufficient condition that allows transforming a failure detector with a limited scope accuracy into its non-limited scope counterpart. This condition states that the scope x of the failure detector has to be greater than t (the upper bound on the number of processes that can crash).
Other failure detector transformations are described in [R56] and [R94]. This last JPDC paper presents a transformation that is quiescent (after some finite time, the transformation does not require message to be exchanged).
□ On weakest failure detector classes
[R97] shows that there is no one-shot agreement problem for which the failure detector class <>P is the weakest that allows to solve it (the class <>P is the class of eventually perfect failure detectors: after some finite time they do suspect only the processes that have crashed).
The papers [R103,C198] study classes of failure detectors whose power can be added. The aim is here to define an “arithmetic” of failure detector classes.
□ Implementing failure detectors
Traditional implementations of failure detectors consider that the system satisfy additional synchronassumptions. In [C160,R93] I proposed (with Mostefaoui and Mourgaya) a totally new approach to implement failure detectors. That approach relies on the pattern of messages exchanged by the processes. Basically, this novel assumption requires that there is one (or several) process(es) whose answers to queries are among the n−t first answers received by the process that issued the query. I have then shown that this novel kind of assumption can be combined with timing assumptions associated with a subset of the channels [R92]. It is important to see that such an approach favors the assumption coverage.
In [R113] is presented a weak timing assumption that allows the construction of an eventual leader service (Omega) in an asynchronous shared memory system, while [C221] considers the case of a message- passing system. Unifying these two approaches is still an open problem.
2.3 The condition-based approach
The consensus problem is one of the most important problem in asynchronous systems prone to failures: it captures the difficulty of coordinating independent entities when these entities can fail. Each process pro- poses a value, and all the non-faulty processes have to agree on the same value that has to be one of the proposed values. Unfortunately, despite its very simple formulation, there is no deterministic algorithm that can solve the consensus problem as soon as (even only) one process can crash. This is the famous FLP impossibility result. Until 2000, mainly three approaches were proposed to circumvent this impossibility. One consists in looking for a non-deterministic solution: a process can draw random numbers. Another consists in enriching the system with a failure detector of an appropriate class. The last one, that concerns shared-memory systems, consists in enriching the system with operations stronger than the base read and write operations (e.g. a Compare&Swap() operation). This approach has given rise to the notion of consensus number and to Herlihy’s hierarchy.
In 2000, with S. Rajsbaum and A. Mostefaoui (during a visit of Sergio R. at IRISA) we started a collaboration that proved to be very fruitful, namely we introduced a novel way of circumventing the FLP impossibility result. This new approach does not consist in enriching the underlying system, but in restricting the set of input vectors (such a restricted set defines a condition). The main challenge is then to characterize the largest conditions that allows solving consensus in presence of up to t process crashes. This is the problem solved in [R78]. Then, [R80] establishes a hierarchy of conditions and presents corresponding efficient algorithms.
While the result in [R78] establishes the frontier from which a condition allows solving consensus despite asynchrony and up to t process crashes, we showed in [R91] that a condition can allow expediting consensus in a synchronous system. Said differently, [R91] extends to synchronous systems the hierarchy of conditions, namely, a condition that allows solving consensus despite t crashes in an asynchronous system, allows solving consensus optimally (in two rounds) in a synchronous systems: decidability in asynchronous system can be converted in efficiency in a synchronous system.
In a very interesting way, we have shown that there is a very strong connection between error-correcting
code and distributed agreement problems [R98] (intuitively, an input vector encodes a value that has to be decided by the processes). We have also investigated the combined power of conditions and failure detectors to solve agreement problems [R105].
Finally, with my PhD student F. Bonnet, I have extended the condition-based approach to address the k-set agreement problem [R110]. This provides us with a very general condition-based framework.
2.4 k-Set agreement in synchronous/asynchronous systems
The k-set agreement problem is a weakening of the consensus in the sense that up to k different values can be decided. I worked on this problem in failure prone synchronous and asynchronous systems.
□ Set agreement in synchronous systems
The main result in this type of systems is described in [R112], where the fault model is the general omission failure model. A new termination property is introduced. Let a good process be process that neither crashes nor commits receive omission failure. The new property, called strong termination, obliges the good processes to decide. A k-set algorithm is presented where every good process decides and halts by round min( f/k+2, k/t +1). This algorithm is clearly optimal.
In addition to the strong termination property, (as far as I know) this algorithm is the only early-deciding and stopping k-set agreement algorithm proposed so far for the general omission failure. A survey of synchronous set agreement is presented in [201].
□ Set agreement in asynchronous systems
Differently from synchronous systems, k-set agreement cannot be solved in asynchronous systems when k ≤ t. As already indicated, [C121] investigates failure detectors with limited scope accuracy to solve k-set agreement, and presented several protocols. Differently, [C131]
considers that the additional power is supplied by random numbers. [R105,C184] add the power of appropriate failure detector classes to conditions in order to solve k-set agreement. In the context of shared memory systems, the invited paper [C205] posed the problem of the weakest failure detector for solving the k-set agreement problem. At the rump session of PODC 2007, I posed the following conjecture at the rump session “Ω k is the weakest failure detector for k−set agreement in shared memory systems”. This conjecture has been positively answered by three independent groups of researchers (PODC 2009).
2.5 Distributed computability
I started becoming interested in distributed computability in 2006. I worked on this topic mainly with E. Gafni, S. Rajsbaum, and C. Travers (who, at that time, was one of my PhD students).
□ Committee decision and renaming
My first (co-authored) contribution was the definition of the committee decision problem [C192] that was later generalized to the notion of simultaneous consensus task [C204].The question answered is the following: “What is the power of the task where processes are involved in k simultaneous consensus instances and each is required to decide in only one of them (several processes deciding possibly in different instances)?”
My second contribution is a wait-free k-Test&Set-based adaptive renaming algorithm [C200]. The paper shows that the new space name can be reduced from 2p−1 (where p is the number of participating processes) to 2p−ceil(k/p) . This work encouraged us to investigate the relation linking Test&Set, adaptive agreement and set agreement. This investigation resulted in [C216,C223]. As Test&Set, adaptive agreement and set agreement are sub-consensus tasks (i.e., they are weaker than consensus), the holy grail is here to establish a hierarchy of subconsensus tasks (if any), similarly to the hierarchy of consensus tasks as defined by M. Herlihy. This line of research is fundamental if we want to understand the power of base synchronization primitives. In some sense, the research agenda is here to establish the “Mendeleiev’s table” of sub-consensus tasks. A very early step in that direction appears in [R106].
I also proposed and investigated (with G. Taubenfeld) the notion of timed register [C208]. A timed register generalizes the notion of an atomic register as follows: if a process invokes two consecutive operations on the same timed register which are a read followed by a write, then the write operation is executed only if it is invoked at most d time units after the read operation, where d is defined as part of the read operation. We show that a timed register is a universal object (i.e., an object from which any object defined by a sequential specification can be built despite asynchrony and failures) .
□ k-Set agreement
[C223] explores a new direction to solve the k-set agreement problem in a synchronous system of n processes. It considers that the system is enriched with base objects (denoted [m, x] SA objects) that allow solving the x-set agreement problem in a set of m processes (m < n). This work has several contributions. It first proposes a synchronous k-set agreement algorithm that benefits from such underlying base objects.
and presents associated lower bounds on the minimal number rof rounds that generalize the bounds previously established for solving the k-set agreement problem in pure synchronous systems.
2.6 Formal models for distributed computing
The Iterated Immediate Snapshot (IIS) model has been introduced by E. Borowsky and E. Gafni (PODC 1993). . It is an asynchronous computation model where processes communicate through a sequence of one-shot immediate snapshot objects. It is known that this model is equivalent to the usual asynchronous read/write shared memory model, for wait-free task solvability. Its interest lies in the fact that its runs are more structured and easier to analyze than the runs in the shared memory model. As the IIS model and the shared memory model are equivalent for wait-free task solvability, a natural question is the following: “Are they still equivalent for wait-free task solvability, when they are enriched with the same failure detector?” Rajsbaum, Travers, and I showed in [R100] that the answer to this question is “no.”
At first glance, this answer can appear counter-intuitive. So, the next question is the following: “Given a shared memory model enriched with a failure detector, what is an equivalent IIS model?” [C226] shows that an elegant way of capturing the power of a failure detector and other partially synchronous systems in the IIS model is by restricting appropriately its set of runs, giving rise to the Iterated Restricted Immediate Snapshot model (IRIS).
2.7 Software transactional memory
I started becoming interested in software transactional memories (STM) in 2007. Basically, the problem consists in discharging the application programmer from the management of the underlying synchronization in multiprocess programs that access shared objects. From a practical point of view, STM systems are one of the most promising approach to take up challenge posed by the recent advance of multicore architectures and the deployment of multiprocessors as the mainstream computing platforms. From a theoretical point of view, STM systems give a new impetus that forces to rethink the way synchronization problems have to be solved (basically, synchronization is coming back, but it is not the same [C224]).
My initial contributions concern the following points. While the fate of a transaction is to commit or abort, no current STM specification states situations where the STM system is forced to commit a transaction. In [C231], with my PhD student D. Imbs, I introduced a new property called obligation that specifies situations where a transaction cannot be aborted. A corresponding STM protocol is presented and formally proved correct.
The second contribution is the definition of a general framework to state consistency conditions suited to STM systems [C236]. This framework not only encompasses serializability, strict serializability and opacity (that are “traditional” consistency conditions), but permits to define new meaningful consistency conditions. Among them the (new) virtual world condition is particularly interesting: it is less restrictive than than opacity, while requiring that (whatever the fact that it commits or aborts) any transaction always reads its values from a consistent global state.
3 Research activity in the past five years: 2010-2014
3.1 Preamble
Whatever their granularity, today distributed applications are pervasive and benefit everyone (e.g., P2P, cloud computing, sensors networks, or social networks for “large grain” applications, and multicore for “small grain” applications). All these applications are becoming larger and larger and more and more distributed. The development of such platforms and their usage have somehow preceded their theoretical foundations. Up to now, their design principles look sometimes more like a “trick” than well-mastered basic principles. The explosion of the number of distributed applications and the number of “computing adversaries” such as scaling, misbehaviors (also characterized as malicious behaviour when referring to entities attempting to voluntarily or not hurt the system), dynamicity, etc., makes their basic principles more and more difficult to grasp. Traditional algorithms simply do not fit this challenging new setting and it is required to revisit the field.
Hence, research addressing distributed computing theory that can benefit future applications is more needed than ever. This is a great challenge for the computer science community and contitutes my main research motivation since 2010. As already said, of my leitmotifs is “When something works we have to know why it works, and when it does not work we have to know why it does not work”. In the recent past my research activity focused on the foundations of distributed and concurrent computing. The world is distributed, and consequently more and more applications are distributed. This provided me with a strong motivation to work in this reaserach area.
Some of the work that follows was funded by by a European Marie Curie Project (Transform) devoted to the theory of software trnasactional memory systems, and to the French ANR project DISPLEXITY, devoted to complexity and computabilty in distributed computing. I do not present them from «a pecific project» point of view, but in a more general scientific spirit.
3.2 A visible part of the iceberg: a few figures in the past five years
• Publications (with review): 29 articles in journals, and 48 papers in conferences.
• Four books: 2 Morgan & Claypool (251 p. & 165 p., 2010), 2 Springer (512 p. & 510 p., 2013).
• Four Best paper awards in top conferences:
(1) ACM Principles of Distributed Computing (PODC 2014, the very top conference in the
domain) [C301],
(2) Symposium on Stabilization, Safety and Security of Distributed Systems (SSS 2011) [C269],
(3) European Parallel Computing (Europar 2010) [253], and
(4) Symposium on Distributed Computing (DISC 2010) [C254] (this last paper –co-authored with
my PhD student F. Bonnet– obtained the “Best student paper”).
• Winner of “Innovation in Distributed Computing’ award (SIROCCO Prize) in 2015.
• Invited talks at top level international conferences: 5.
• Spring/Winter/Summer Schools and conference tutorials: 10.
• Main foreign co-authors (alphabetical order) during the period 2011-2014:
S. Arevalo (Universidad Politecnica de Madrid, Spain), R. Baldoni (La Sapienza, Roma, Italy),
J. Cao (Hong Kong Polytechnic University), A. Castaneda (UNAM, Mexico),
A. Fernandez (Institute IMDA, Madrid, Spain), V. Gramoli (University of Sydney, Australia),
M. Herlihy (Brown University, RI, USA), M. Larrea (University Basque Country, San Sebastian,
Spain), Y. Moses (Technion, Israël), S. Rajsbaum (UNAM, Mexico), G. Taubenfeld (Herlizya,
Israël) W. Wu (Sun Yat-Sen University, Guangzhou, China).
3.3 Symmetry breaking
□ Symmetry breaking (in the presence of failures) is a fundamental problem of distributed computing. More precisely, processes in a concurrent system need to coordinate using an underlying shared memory or a message-passing system in order to solve agreement tasks such as, for example, consensus or set agreement. However, coordination is often needed to break the symmetry of processes that are initially in the same state, for example, to get exclusive access to a shared resource, to get distinct names, or to elect a leader.
□ I (with A. Castaneda, D. Imbs, and S. Rajsbaum) introduced and studied the family of generalized symmetry breaking (GSB) tasks, that includes election, renaming and many other symmetry breaking tasks [C265]. Differently from agreement tasks, a GSB task is inputless, in the sense that processes do not propose values; the task only specifies the symmetry breaking requirement, independently of the initial state of the system (where processes differ only on their identifiers). Among various results characterizing the family of GSB tasks, we showed that perfect renaming is universal for all GSB tasks [C275]. We then studied the power of renaming with respect to k-set agreement. We showed that, in a system of n processes, perfect renaming is strictly stronger than (n−1)-set agreement, but not stronger than (n−2)-set agreement. Furthermore, (n+1)-renaming cannot solve even (n−1)-set agreement. As a consequence, there are cases where set agreement and renaming are incomparable when looking at their power to implement each other. We also showed that there is a large family of GSB tasks that are more powerful than (n−1)-set agreement [C287]. Some of these tasks are equivalent to n-renaming, while others lie strictly between n-renaming and (n+1)-renaming. Moreover, none of them can solve (n−2)-set agreement. Hence, the GSB tasks have a rich structure and are interesting in their own. The proofs of these results are based on combinatorial topology techniques and new ideas about different notions of non-determinism that can be associated with shared objects. Interestingly, this paper sheds a new light on the relations linking set agreement and symmetry breaking. All these results are pieced together in [R140]. We also showed that the notion of a process group allows the renaming space to be reduced according to the number of process groups [C284].
3.4 Failure detectors
Hybrid distributed system
□ With my PhD student D. Imbs, I introduced an asynchronous crash-prone hybrid system model, where the system is hybrid in the way the processes can communicate. On the one side, a process can send messages to any other process. On another side, the processes are partitioned into clusters and each cluster has its own read/write shared memory. In addition to the model, one of our contributions concerns the implementation of an atomic register in this system model. More precisely, we introduced a new failure detector (denoted MΣ) and showed that, when considering the information on failures needed to implement a register, this failure detector is the weakest. To that end, we presented an MΣ-based algorithm that builds a register in the considered hybrid system model and showed that it is possible to extract MΣ from any failure detector-based algorithm that implements a register in this model. We also (a) showed that MΣ is strictly weaker than Σ (which is the weakest failure detector to implement a register in a classical message-passing system) and (b) presented a necessary and sufficient condition to implement MΣ in a hybrid asynchronous communication system. These result are described in [R134] and [C269] (which obtained the Best Paper Award at SSS 2011).
Iterated distributed model
□The basic distributed asynchronous read/write computation model is made up of n asynchronous processes which communicate by reading and writing atomic registers only. The distributed asynchronous iterated model is a more constrained model in which the processes execute an infinite number of rounds and communicate at each round with a new object called immediate snapshot object. Moreover, in both models, up to n − 1 processes may crash in an unexpected way. When considering computability issues, two main results are associated with the previous models. The first states that they are computationally equivalent for decision tasks. The second states that they are no longer equivalent when both are enriched with the same failure detector.
□ With my PhD student J. Steiner, I showed how to capture failure detectors in each model so that both models become computationally equivalent [C280]. To this end, I introduced the notion of a “strongly correct” process which appears particularly well-suited to the iterated model, and designed simulations that prove the computational equivalence when both models are enriched with the same failure detector. I also extended these simulations to the case where the wait-freedom requirement is replaced by the notion of t-resilience. (The important new idea is here the notion of strongly correct processes. Those are the processes that “see” each other infinitely often. If, after some time, a process is late, it sees the values written by the strongly correct processes but the values it writes are not seen by them.)
3.5 Anonymous systems
□ Due to the multiplicity of loci of control, a main issue distributed systems have to cope with lies in the uncertainty on the system state created by the adversaries that are asynchrony, failures, dynamicity, mobility, etc. Considering message-passing systems, I addressed (with my PhD student F. Bonnet) the uncertainty created by the net effect of asynchrony and process crash failures in systems where the processes are anonymous (i.e., processes have no identity and locally execute the very same algorithm). Trivially, agreement problems such as consensus, that cannot be solved in non-anonymous asynchronous systems prone to process failures, cannot be solved either if the system is anonymous. So, we investigated failure detectors that allow processes to circumvent this impossibility. In [R121] we introduced a failure detector class denoted ψ, that gives to each process an upper bound on the number of processes that are currently alive (in a non-anonymous system, the classes ψ and P - the class of perfect failure detectors- are equivalent). We designed a simple ψ-based consensus algorithm where the processes decide in 2t + 1 asynchronous rounds (where t is an upper bound on the number of faulty processes), and showed shows that 2t+1 is a lower bound for consensus in the anonymous systems equipped with ψ. We then addressed early-decision, and presented and proved an early-deciding algorithm where the processes decide in min(2f+2, 2t+1) asynchronous rounds (where f is the actual number of process failures). This leads to think that anonymity doubles the cost (wrt synchronous systems) and it is conjectured that min(2f+2, 2t+1) is the corresponding lower bound.
□ We then continued our work of anonymous distributed systems and presented four failure detectors (denoted AP, AP, AΩ, and AΣ) and show that they are the “identity-free” counterparts of perfect failure detectors, eventual leader failure detectors and quorum failure detectors, respectively. AΣ is new and showing that AΣ and Σ have the same computability power in a non-anonymous system is not trivial. We also showed that the notion of failure detector reduction is related to the computation model. Then, we presented and proved correct a uniform anonymous consensus algorithm based on the failure detector pair (AΩ, AΣ) (“uniform” means here that not only processes have no identity, but no process is aware of the total number of processes). This new algorithm is not a simple “straightforward extension” of an algorithm designed for non-anonymous systems. To benefit from AΣ, it uses a novel message exchange pattern where each phase of every round is made up of sub-rounds in which appropriate control information is exchanged. Finally, we introduced the notions of failure detector hierarchy, and weakest failure detector, for anonymous consensus, and the implementation of identity-free failure detectors in anonymous systems. This work was published in [R133, C254].
□ I also studied the computability power of homonymous systems in [C277]. In these systems, several processes can have the same name. If all processes have distinct names we are in a classical dsitributed system, while we are in an anonymous system if all the processes have the same name. I mainly studied the solvability of agreement problems in such a context.
3.6 Synchronous vs asynchronous systems: towards an equivalence
□ A message adversary is a daemon that suppresses messages in round-based message-passing synchronous systems in which no process crashes. This notion has first been introduced by N.Santoro and P. Widmayer (STACS, 1989). A property imposed on a message adversary defines a subset of messages that cannot be eliminated by the adversary. It has recently been shown that, when a message adversary is constrained by ;a property denoted TOUR (for tournament), the corresponding synchronous system and the asynchronous crash-prone read/write system have the same computability power for task solvability.
□ With my PhD student J. Stainer, I investigated in [C290] new message adversary properties (denoted SOURCE and QUORUM), and showed that the synchronous round-based systems whose adversaries are constrained by these properties are characterizations of classical asynchronous crash-prone systems (1) in which processes communicate through atomic read/write registers or point-to-point message-passing, and (2) enriched with failure detectors such as the eventual leader Ω and the quorum failure detector Σ. Hence these properties characterize maximal adversaries, in the sense that they define strongest message adversaries equating classical asynchronous crash-prone systems. They consequently provide strong relations linking round-based synchrony weakened by message adversaries with asynchrony restricted with failure detectors. This not only enriches our understanding of the synchrony/asynchrony spectrum, but also allows for the establishment of a meaningful hierarchy of property-constrained message adversaries.
3.7 Asynchronous systems with Byzantine processes
□ Since 2013, I started to work again on Byzantine failures. This work (with A. Mostéfaoui and H. Moumen) presented in [C301] introduces a new round-based asynchronous consensus algorithm that copes with up to t Byzantine processes. This algorithm has the following noteworthy optimality properties: (a) it requires only t<n/3 (where n is the total number of processes); (b) it does not assume a computationally-limited adversary (hence it is not signature-based); (c) the expected number of rounds to decide is four; (d) each round is composed of at most three communication steps and involves O(n²) messages; and (e) a message is composed of a round number plus a single bit. Noe of the previous algorithms solving binary Byzantine consensus has all these properties; they either use signatures, or have an O(n³) message complexity, or require t<n/4.
To attain this optimality goal, the proposed consensus algorithm relies on a common coin as defined by Rabin, and a new extremely simple and powerful broadcast abstraction suited to binary values. The main target when designing this algorithm was to obtain a simple and efficient and algorithm. This was motivated by the fact that, among the first-class properties, simplicity –albeit sometimes under-estimated– is a major one. This work obtained the “Best paper” award at PODC 2014.
3.8 Concurrent data structures
□ An atomic snapshot object is an object that can be concurrently accessed by asynchronous processes prone to crash. It is a fundamental object of concurrent programming in the presence of failures. It is made of m components (base atomic registers) and is defined by two operations: an update operation that allows a process to atomically assign a new value to a component, and a snapshot operation that atomically reads and returns the values of all the components. In [R127] I proposed an algorithm implementing a partial snapshot object, i.e., an object where the snapshot operation that can take any subset of the components as input parameter, and atomically reads and returns the values of this subset of components. This algorithm is based on new notions called help-locality and freshness. Help-locality requires that an update operation helps only the concurrent partial snapshot operations that read the component it writes. When an update of a component r helps a partial snapshot, freshness requires that the update provides the partial snapshot with a value of the component r that is at least as recent as the value it writes into that component. (No snapshot algorithm proposed so far satisfies these properties). The algorithm is wait-free, linearizable and satisfies the previous efficiency properties. Interestingly, the principle that underlies the proposed algorithm is different from the one used so far, namely, it is based on the “write first, and help later” strategy. An improvement of the previous algorithm, based on LL/SC atomic registers, is also presented, which decreases the number of base registers from O(n²) to O(n).
□ With T. Crain (Marie Curie PhD student) we investigated efficient implementations of concurrent data structures such as binary trees, skip-lists, etc. The corresponding results are described in [C273,C288,C291].
3.9 Software transactional memory
□ The aim of a Software Transactional Memory (STM) system is to discharge the programmer from the ex-plicit management of synchronization issues. The programmer’s job resides in the design of multiprocess programs in which processes are made up of transactions, each transaction being an atomic execution unit that accesses concurrent objects. The important point is that the programmer has to focus her/his efforts only on the parts of code which have to be atomic execution units without worrying on the way the corresponding synchronization has to be realized.
□ After having introduced the virtual world consistency (VWC) condition [R128], I have shown that the three properties read invisibility, permissiveness and opacity are incompatible, while read invisibility, permissiveness and VWC are comptatible [C271]. While opacity requires that all the transactions (be them aborted or committed) appear as being totally ordered, VWC is weaker in as it only requires that an aborted transaction be ordered with respect to committed transactions only. This allows more transactions to be committed.
□ Among other results I also designed (with my PhD students D. Imbs and T. Crain) a universal construction for transaction-based multiprocess systems [R135]. This construction is such that (1) every invocation of a transaction is executed exactly once and (2) the notion of commit/abort of a transaction remains unknown to the programmer. This system, which imposes restriction neither on the design of processes nor on their concurrency pattern, can be seen as a step towards the design of a deterministic universal construction to execute transaction-based multiprocess programs on top of a multiprocessor. Interestingly, the proposed construction is lock-free (in the sense that it uses no lock).
3.10 Miscellaneous
□ During the past four years I also worked on other topics. I presents here only two of them. An important concept is the concept of recursivity. I investigated recursivity in asynchronous distributed systems where communication is through atomic read/write registers, and any number of processes can commit crash failures [R139,C274]. In such a context and differently from sequential and parallel recursion, the conceptual novelty lies in the fact that the aim of the recursion parameter is to allow each participating process to learn the number of processes that it sees as participating to the task computation.
□ A second important work is a joint work with Y. Moses [R130]. This work addresses the condition based simultaneous consensus problem in synchronous message-passing systems.
4. Looking at the future
□ Research constitutes one of the main «raisons d’ être» of the Universities. It is an obligation for all professors and an absolutely necessary activity to maintain their lectures vivid and up-to-date. I do think that, as a university professor (in computer science), my research activity has to concentrate on understanding computing phenomena, introducing computing-related concepts and clarifying notions that are relevant to computing. That is why I always strove to design algorithms that are as generic and simple as possible. Being generic and simple, they are not bound to specific contexts and consequently capture the essence of the problem they solve (the complementary facet to capture their essence being the determination of the lower bounds associated with the problems they solve).
□ I also think that genericity and simplicity (that go with beauty and elegance) are first-class citizen criteria when designing solutions to computing problems.
4.1 A big picture : At the frontiers of distributed computing
□ As already indicated, from a theoretical point of view, the aim of distributed computing is to answer the question: “Which problems can be solved by a set of cooperating entities, and, if the answer is yes, which are the best algorithms for the corresponding problem (best according to some complexity measures)?” Since more than twenty years, this constitutes my research program, which focuses on fundamental issues, namely investigate the limits of distributed computing in presence of adversaries such that asynchrony, failures, anonymity, dynamicity, etc. Lots of results are known for asynchronous systems prone to failures (e.g., the fundamental problem that is consensus has concentrated lots of efforts and its study has provided computer scientists with a deep knowledge on what constitutes a part of the essence of distributed computing). Despite these great advances, lot of work remains to be done. As an example, despite their interest in real applications, dynamicity and anonymity are not yet well understood. My research programme looks in this direction and focuses on the following domains: distributed computability in asynchronous read/write shared memory systems, distributed computability in asynchronous message-passing systems, relations between the two previous distributed computing models, and concurrent objects. It is of great importance to develop a theory of distributed computing that provides us with concepts and paradigms that help us understand the possibilities and limitations of distributed systems. Such a knowledge is a necessary pre-requisite if one wants to master future (non-trivial) distributed applications.
□ Today distributed applications are pervasive, some very successful (e.g., Internet, P2P, social networks, cloud computing), and benefit everyone, but the design and the implementation of a lot of them rely more on «tricks» than on sane scientific foundations.
Keywords
Distributed computability, Distributed computing, Asynchronous/synchronous system, Message- passing system, Shared memory system, Fault-tolerance, Concurrent object, Synchronization.
4.2 Distributed computability
Be the communication medium a shared memory of a message-passing system, the aim of distributed computability is to answer the question “what can be computed in a distributed system?” I present below a few fundamental distributed computability problems in which I am interested.
From decision problems to the ranking of sub-consensus tasks
□ I became acquainted with this topic when I tried to establish a connection between the adaptive renaming problem and both the k-set agreement and the (weaker) k-test&set problem [C200,C216]. These works showed that the new space name can be reduced from 2p − 1 (where p is the number of participating processes) to 2p−floor(k/p) if we have underlying k-test&set objects, and to p+k−1 if we have underlying k-set agreement objects. □ These results encouraged me to to investigate the relation linking Test&set, adaptive agreement and set agreement. This investigation resulted in [C216]. This “warm-up” research period showed me the richness and the profoundness of the topic. As Test&set, adaptive agreement and set agreement are sub-consensus tasks (i.e., they are weaker than consensus), the “Holy grail” quest is here to establish a hierarchy of sub-consensus tasks (if any), similarly to the hierarchy of consensus tasks as defined by Herlihy. This line of research is fundamental if we want to understand the power of base synchronization primitives. In some sense, we can say that while Herlihy has established the “Mendeleiev’s table” of consensus tasks, the research agenda is here to establish the corresponding table of sub-consensus tasks.
□ The recent paper to appear in SIAM JC [R140] is a promising step in this direction, that I want to continue investigating.
Distributed universal construction
□ A notion of a universal construction suited to distributed computing has been introduced by M. Herlihy in his celebrated paper on wait-free synchronization (ACM TOPLAS 1981). A universal construction is an algorithm that can be used to wait-free implement any object defined by a sequential specification. Herlihy’s paper shows that the basic system model, which supports only atomic read/write registers, has to be enriched with consensus objects to allow the design of universal constructions. The generalized notion of a k-universal construction has been recently introduced by Gafni and Guerraoui (CONCUR 2011). A k-universal construction is an algorithm that can be used to simultaneously implement k objects (instead of just one object), with the guarantee that at least one of the k constructed objects progresses forever. While Herlihy’s universal construction relies on atomic registers and consensus objects, a k-universal construction relies on atomic registers and k-simultaneous consensus objects (which are wait-free equivalent to k-set agreement objects in the read/write system model).
□ I intend to work on distributed universal constructions, and I already started thinking to build a very general universal construction with the following properties (not satisfied by previous universal constructions). (1) Among the k objects that are constructed, at least x objects (and not just one) are guaranteed to progress forever; (2) The progress condition for processes is wait-freedom, which means that each correct process executes an infinite number of operations on each object that progresses forever; (3) If any of the k constructed objects stops progressing, all its copies (one at each process) stop in the same state; (4) The proposed construction is contention-aware, in the sense that it uses only read/write registers in the absence of contention; (5) It has to be generous with respect to the obstruction-freedom progress condition, which means that each process is able to complete any one of its pending operations on the k objects if all the other processes hold still long enough. Such a construction, that I call (k,x)-universal, should be based on a simple extension of k-simultaneous consensus objects that I (with co-authors) introduced in [R117].
A lasting open problem:k-set agreement in message-passing systems
□ The weakest failure detector for message-passing k-set agreement Assuming each process proposes a value, the k-set agreement requires that each non-faulty process decides a value such that a decided value is a proposed value, and at most k different values are decided. This problem, which generalizes consensus, is impossible to solve in asynchronous crash-prone systems. While the weakest failure detector for solving the k-set agreement problem in crash-prone asynchronous read/write shared memory systems is known, for message-passing systems the weakest failure detectors are known only for the extreme cases k=1 and k =n−1.
□ The important remaining problem is then finding the weakest failure detector for any value of k. I think that answering this question is related to the minimal consistency properties a shared memory has to satisfy in order the k-set agreement problem can be solved (these properties being weaker than the classical properties associated with read/write operations). I started investigating this difficult question, but up to now I have only partial answers [R122,C270,C282]. Hence, despite the efforts of the community, this is an important and lasting consistency problem that belongs to my research programme. My hope is that solving this problem will give us a much clearer view of the relation between read/write shared memory systems and message-passing systems.
4.3 Fault-tolerance in the presence of Byzantine processes
A Byzantine process is a process that behaves arbitrarily. This type of process failures (be it intentional or not) can no longer be ignored in message-passing systems. This has motivated myrecent effort on this topic (e.g., [C303] SIROCCO 2014, and [C301] Best Paper at PODC 2014). I intend to continue to work in this important topic in the next years. More precisely, I want to address the following issues.
Efficient multivalued Byzantine consensus
□ The paper [C301] describes a randomized signature-free algorithm that solves binary consensus with an expected cost of O(n²) messages. “Binary” means that only two different values can be proposed. Differently, any number of different values can be proposed to “multivalued” consensus. It exists algorithms that solves multivalued consensus on top binary consensus, but the message cost of these reduction algorithms is O(n³). I have the intuition that the multivalued consensus problem can be solved with O(n²) expected messages, and a constant number of expected rounds. Such an algorithm would provide a «definitive» answer to the problem.
□ Another open problem that I want to address is the following. Let us consider the family of deterministic Byzantine consensus algorithms (i.e., algorithms which are not randomized algorithms). In such a context a fundamental question is the following: “Which is the weakest synchrony assumption needed to solve consensus in a message-passing system prone to Byzantine process failures?” This is a difficult problem on which I have some ideas and that I want to address.
Computability in the presence of Byzantine failures
□ Borowsky-Gafni’s (BG) simulation is a very powerful reduction algorithm designed for asynchronous read/write crash-prone systems, namely, it allows a set of (t + 1) asynchronous sequential processes to wait-free simulate (i.e., despite the crash of up to t of them) an arbitrary number n of processes under the assumption that at most t of them crash. This shows that, in read/write crash-prone systems, t-resilience of decision tasks can be fully characterized in terms of wait-freedom. Said another way, the BG simulation shows that, in read/write systems, a crucial parameter is not the number n of processes, but the upper bound t on the number of processes that may crash in a run.
□ I intend to design BG-like simulations in the context of asynchronous message-passing systems (which has never been considered before). This will be done in two directions. The first will consider that processes may fail by crashing. Assuming t < min(n',n/2), the aim is to simulate a system of n' processes where up to t may crash, on top of a basic system of n processes where up to t may crash. The second simulation will concern the case where processes may commit Byzantine failures (up to now the BG simulation considered only process crahs failures). Assuming t < min(n',n/3), the aim is here to simulate a system of n' processes where up to t may be Byzantine, on top of a basic system of n processes where up to t may be Byzantine. Moreover, these asynchronous message-passing simulations have to be direct in the sense that they are not allowed to simulate a shared memory on top of which a suited read/write BG simulation would be used.
□These constraints are motivated by the fact that they help better understand the deep nature and the difference of crash failures and Byzantine failures in asynchronous message-passing systems.
4.4 Relating theory and practice
A Holy Grail
Is there a “Grand Unified Model”? Synchronous systems and asynchronous systems are the two endpoints of the synchrony axis. With the new adversaries that are anonymity, dynamicity, mobility, etc., finding a distributed computing model that is both realistic and abstract enough (to be tractable) is a real challenge, which will maybe remain an inaccessible scientific “holy grail”. Like physicists who are looking for a ”Grand Unified Theory” that would encompass all the fundamental concepts of physics, a (much more modest but still) very challenging task consists in looking for a “Grand Unified Model” for distributed computing. Albeit answering this question seems to much ambitious, it remains at the horizon as a Leibnitz’s dream. I started working very recently on this topic ([C290] is a very partial anwser to this fundamental issue) and I inted to continue.
Mathematics of distributed computing
□ After discussion with S. Rajsbaum (UNAM Mexico) and M. Herlihy (Brown University, RI, USA), I started becoming interested in topology as the part of mathematics that might provide us with underlying foundations of distributed computing (this approach is described in their very recent book).
□ I consequently worked with them and we wrote an introductory survey [R137]. We then introduced and investigated thhe notion of solo executions. A process runs solo when it computes its local output without receiving any information from other processes, either because they crashed or they are too slow. While in wait-free shared-memory models at most one process may run solo in an execution, any number of processes may have to run solo in an asynchronous wait-free message-passing model. We introduced a family of round-based wait-free models, called the d-solo models, 1 ≤ d ≤ n, where up to d processes may run solo, and gave a characterization of the colorless tasks that can be solved in each d-solo model [C297]. These results establish for the first time a hierarchy of wait-free models that, while weaker than the basic read/write model, are nevertheless strong enough to solve non-trivial tasks.
□ I plan to continue investigating the topology approach to give sane foundations to distributed computing. Maybe this will help answering (partially) important questions raised in the previous section.
Engineering of distributed computing
□ Practitioners and engineers have proposed a number of reusable frameworks and services to implement specific distributed services (from Remote Procedure Calls with Java RMI or SOAP-RPC, to JGroups for group communication, and Apache Zookeeper for primary backup replication). Unfortunately, many of these efforts lack a sound grounding in distributed computation theory (with the notable exceptions of JGroups and Zookeeper), and only provide punctual and partial solutions for a narrow range of services.
□ From my point of view, this is because we still lack a generic framework that is able to unify the large body of fundamental knowledge on distributed computation that has been acquired over the last 20 years. A central issue of distributed computing consists consequently in bridging this gap, by developing a systematic model of distributed computation that organises the functionalities of a distributed computing system into reusable modular constructs. These constructs should be composable via well-defined mechanisms that maintain sound theoretical guarantees on the resulting system.
□ In relation with my previous research topics, I intend to spend some time and efforts also in this direction, which is crucial for distributed computing engineering. Sound distributed computing engineering is related to the foundations of distributed computing. |