You are here

Uniform Cloud and Edge Stream Processing for Fast Big Data Analytics

Team and supervisors
Department / Team: 
Team Web Site: 
https://team.inria.fr/kerdata/
PhD Director
Alexandru Costan
Co-director(s), co-supervisor(s)
Gabriel Antoniu
Contact(s)
NameEmail addressPhone Number
Alexandru Costan
alexandru.costan@irisa.fr
+ 33 2 99 84 75 66
PhD subject
Abstract

Context

The recent spectacular rise of the Internet of Things (IoT) and the associated augmentation of the data deluge motivated the emergence of Edge computing [1] as a means to distribute processing from centralized Clouds towards decentralized processing units close to the data sources. The key idea is to leverage computing and storage resources at the “edge” of the network, i.e., near the places where data is produced (e.g., sensors, routers, etc.). They can be used to filter and to pre-process data or to perform (simple) local computations (for instance, a home assistant may perform a first lexical analysis before requesting a translation to the Cloud).

 

This scheme led to new challenges regarding the ways to distribute processing across Cloud-based, Edge-based or hybrid Cloud/Edge-based infrastructures. State-of-the-art approaches advocate either “100% Cloud” or “100% Edge” solutions. In the former case,  a plethora of Big Data stream processing engines like Apache Spark [2] and Apache Flink [3] emerged for data analytics and persistence, using the Edge devices just as proxies only to forward data to the Cloud. In the latter case, Edge devices are used to take local decisions and enable the real-time promise of the analytics, improving the reactivity and ”freshness” of the results. Several Edge analytics engines emerged lately (e.g. Apache Edgent [4], Apache Minifi [5]) enabling basic local stream processing on low performance IoT devices. The relative efficiency of a method over the other may vary. Intuitively, it depends on many parameters, including network technology, hardware characteristics, volume of data or computing power, processing framework configuration and application requirements, to cite a few.

Problem statement

Hybrid approaches, combining both Edge and Cloud processing require deploying two such different processing engines, one for each platform. This involves two different programming models, synchronization overheads between the two engines and empirical splits of the processing workflow schedule across the two infrastructures, which lead eventually to sub-optimal performance.

Thesis goal

This PhD thesis aims to propose a unified stream processing model for both Edge and Cloud platforms. In particular, the thesis will seek potential answers to the following research question: how much can one improve (or degrade) the performance of an application by performing computation closer to the data sources rather than keeping it in the Cloud? To this end, the PhD thesis will first devise a methodology to understand the performance trade-offs  of Edge-Cloud executions, and then design a unified processing model capable of exploiting the semantics of both platforms. The model will be implemented and experimentally evaluated with representative real-life stream processing use-cases executed on hybrid Edge-Cloud testbeds. The high-level goal of this thesis is to enable the usage of a large spectrum of Big Data analytics techniques at extreme scales, to support fast decision making in real-time.

Target use-case

The unified processing model will be evaluated using the requirements of a real-life production application, provided by our partners from University Politehnica of Bucharest. The unified model will be used as the processing engine of the MonALISA [6] monitoring system of the ALICE experiment at CERN [7].  ALICE (A Large Ion Collider Experiment) is one of the four LHC (Large Hadron Collider) experiments run at CERN (European Organization for Nuclear Research). ALICE collects data at a rate of up to 4 Petabytes per run and produces more than 109 data files per year. Tens of thousands of CPUs are required to process and analyze them. More than 350 MonALISA services are running at sites around the world, collecting information about ALICE computing facilities, local- and wide-area network traffic, and the state and progress of the many thousands of concurrently running jobs. Currently, all the monitoring and alerting logic is implemented in the Cloud, with high latency. With the proof-of-concept envisioned by this thesis, the goal is to enable Edge/Cloud monitoring data processing and to enable faster alerts and decision making, as soon as the data is collected.

Enabling technologies

 

In the process of designing the unified Edge-Cloud data processing framework, we will leverage in particular techniques for data processing already investigated by the participating teams as proof-of-concept software, validated in real-life environments:

  • The KerA [8] approach for Cloud-based low-latency storage for stream processing (currently under development at Inria, in collaboration with Universidad Politécnica de Madrid, in the framework of a contractual partnership between Inria and Huawei Munich). By eliminating storage redundancies between data ingestion and storage, preliminary experiments with KerA successfully demonstrated its capability to increase throughput for stream processing.
  • The Planner [9] middleware for cost-efficient execution plans placement for uniform stream analytics on Edge and Cloud. Planner automatically selects which parts of the execution graph will be executed at the Edge in order to minimize the network cost. Real-world micro-benchmarks show that Planner reduces the network usage by 40% and the makespan (end-to-end processing time) by 15% compared to state-of-the-art.
Bibliography

[1] M. Satyanarayanan, “The emergence of edge computing,” Computer, 2017.

[2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. “Spark: cluster computing with working sets”. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 2010.

[3] P. Carbone, S. Ewen, G. Fóra, S. Haridi, S. Richter, and K. Tzoumas. “State management in Apache Flink: consistent stateful distributed stream processing”. Proc. VLDB Endow. 10, 12  2017, 1718-1729.

[4]  Apache Edgent, http://edgent.apache.org, accessed on January 2019

[5]  Apache Minifi, https://nifi.apache.org/minifi/, accessed on January 2019

[6] I. Legrand, R. Voicu, C. Cirstoiu, C. Grigoras, L. Betev, and A. Costan. “Monitoring and control of large systems with MonALISA”. Communications of the ACM 52, 9, September, 2009, 49-55.

[7] The ALICE LHC Experiment, https://home.cern/science/experiments/alice, accessed on January 2019.

[8] O.C. Marcu, A. Costan, G. Antoniu, M. Pérez-Hernández, B. Nicolae, et al.. “KerA: Scalable Data Ingestion for Stream Processing”. ICDCS 2018 - 38th IEEE International Conference on Distributed Computing Systems, Vienna, Austria, pp.1480-1485, 2018,

[9] L.Prosperi, A.Costan, P.Silva and G.Antoniu,“Planner:Cost-efficient Execution Plans Placement for Uniform Stream Analytics on Edge and Cloud,” in WORKS 2018: 13th Workflows in Support of Large-Scale Science Workshop, held in conjunction with the IEEE/ACM SC18 conference, 2018.

Work start date: 
October 1st, 2019
Keywords: 
Big Data, stream processing, cloud computing, edge computing
Place: 
IRISA - Campus universitaire de Beaulieu, Rennes