You are here

Elastic In Transit Data Analysis for High Performance Computing Applications

Team and supervisors
Department / Team: 
Team Web Site: 
https://team.inria.fr/kerdata/
PhD Director
Gabriel Antoniu
Co-director(s), co-supervisor(s)
Matthieu Dorier
Contact(s)
NameEmail addressPhone Number
Gabriel Antoniu
gabriel.antoniu@inria.fr
+33299847244
Matthieu Dorier
mdorier@anl.gov
PhD subject
Abstract

1. Description

Supercomputers are expected to reach Exascale by 2021. With millions of cores grouped in massively multi-core nodes, such machines are capable of running scientific simulations at scales and speeds never achieved before, benefiting domains such as biology, astrophysics, or computational fluid dynamics. These simulations however pose important data management challenges: they typically produce very large amounts of data (in the order of petabytes) that have to be processed to get a scientific insight. Storing the data for later, “offline” analysis, becomes infeasible. Hence, many computational scientists have moved to “in situ” and “in transit” analysis strategies, in which analysis tasks are executed on the same node (where the simulation is running) or on remote nodes, respectively. These techniques usually rely on a library or middleware that interfaces the simulation with analysis codes.

To enable in situ / in transit analysis, different software technologies such as Damaris [1,2], Decaf [3], ADIOS [4] and FlowVR [5] have been developed. In all of the mentioned technologies, a subset of available resources (e.g. cores/nodes) are allocated to data analysis tasks. This allocation is completely static. This means that it is not possible for the simulation to attach/detach to analysis processes while the simulation is running. In addition, it is not possible to dynamically provision the analysis resources. These elasticity requirements are crucial in some cases wherein the in situ / in transit analysis (e.g. visualization) is only needed in specific times (e.g. during working time) and should be turned off later (e.g. during the night). To this date, no existing in situ / in transit analysis middleware allows such elasticity.

In addition, the current state-of-the-art in situ/transit technologies are mainly developed based on message passing interface (MPI). Although MPI is employed on most of the world-class supercomputing applications, it lacks some essential features such as load balancing or support for heterogeneous environments. These features are essential for in situ / in transit libraries and middleware in many cases. As an example, many supercomputers around the world are equipped with specific clusters for visualization tasks (using GPUs) connected to the main computing resources via a high-speed network, but they use ordinary network connections between computing nodes. Such modern architectures cannot be used for in transit visualization, nevertheless the in transit middleware could support such a heterogeneous environment.

This PhD thesis aims to enable elastic in situ / in transit analysis. From a practical perspective, research will be conducted to explore how to support such features. For experimental purposes,  these features will be implemented and evaluated on top of Damaris, a middleware for scalable I/O (input/output) and in situ / in transit analysis and visualization of HPC simulations, developed at INRIA in the KerData team: https://project.inria.fr/damaris/. It allows to dedicate some cores (or nodes) of a multi-core node (or a cluster) for data processing tasks, and enables visualization frameworks to connect and interact with running simulations. Damaris is used by several academic and industry partners, including Total, who uses it for in situ visualization of its geophysics simulations.

Enabling elastic in situ / in transit analysis within the Damaris framework will involve research on (1) how to efficiently migrate and reorganize data when the amount of resources dedicated to analysis tasks changes, (2) how to do so in a way that is transparent to both the running simulation and the analysis programs, and (3) how to make efficient use of RDMA (Remote Direct Memory Access) for such elasticity.

As a refinement, we also plan to design a mechanism for incrementally migrating running stream tasks from the in-situ processing backend to the in-transit one without stopping the query execution.

At a technical level, this work will also involve replacing the current MPI-based communication mechanism in Damaris with an RPC-based mechanism, using software libraries developed at Argonne National Laboratory (ANL).

2. Target Use Cases

The proposed solution will be evaluated with real-life simulations. In particular, we envision a collaboration with the Japanese Aerospace Exploration Agency (JAXA). In this context, the candidate would have an internship at JAXA to augment an existing CFD simulation with elastic in situ/transit visualization on a heterogeneous machine (to be confirmed). Other application environments can be explored in collaboration with Argonne National Lab (USA), with which the KerData team has a long-running collaboration, and where the PhD student could also make research visits.

3. Enabling Techniques

In the process of enabling elasticity in Damaris, we will leverage the following tools:

3.1 Damaris

Damaris is a middleware for scalable, asynchronous I/O and in situ and in transit visualization and processing developed at Inria. Damaris already demonstrated its scalability up to 16,000 cores on some of the top supercomputers of Top500, including Titan, Jaguar, and Kraken. Developments are currently in progress in a contractual framework between Total and Inria to use Damaris for in situ visualization for extreme-scales simulations at Total.

3.2 Mercury

Mercury is a Remote Procedure Call (RPC) and Remote Direct Memory Access (RDMA) library developed by Argonne National Laboratory and the HDF Group. It is at the core of multiple DOE projects at Argonne. Mercury enables high-speed, low-latency RPCs and data transfers over a wide range of network fabrics (TCP, Infiniband, Cray GNI, etc.). Considering the fact that current MPI implementations are not flexible in heterogeneous environments, in the context of this PhD thesis, Mercury will be used to replace the communication layer of Damaris.

3.3 Argobots

Argobots is a threading/tasking framework developed at Argonne. It enables efficient use of massively multicore architectures, targeting Exascale supercomputer. It will be used to enable coroutine-style analysis plugins in Damaris.

The thesis will be mainly hosted by the KerData team at Inria Rennes Bretagne Atlantique. It will include collaborations with Argonne National Lab (which provides some of tools for RPC, RDMA, and threading that we intend to use) and possibly with JAXA (which could provide a use case).

4. Mobility

The PhD position is based in Rennes, at the Inria/IRISA center. To be eligible for some funding schemes, mobility may be a requirement (either before starting an international master in France, or between the master thesis and the PhD thesis).  Candidates could be expected be hosted for 3-month internships at other partners, e.g., Argonne National Laboratory (USA), or JAXA (to be confirmed/discussed).

5. Requirements of the candidate

  • An excellent Master degree in computer science or equivalent
  • Strong knowledge of parallel and distributed systems
  • Knowledge on storage and (parallel/distributed) file systems
  • Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
  • Strong programming skills, in particular in C/C++ (including, if possible, C++14), and at least one scripting language (e.g. Python, Ruby)
  • Strong software design skills (knowledge of design patterns in C/C++)
  • Working experience in the areas of Big Data management, Cloud computing, HPC, is an advantage
  • Very good communication skills in oral and written English.
  • Open-mindedness, strong integration skills and team spirit

6. How to Apply?

To apply, please email a cover letter, CV, contact address of at least two professional references and copies of degree certificates to Dr. Gabriel Antoniu and Dr. Matthieu Dorier. Incomplete applications will not be considered or answered.

Bibliography

[1] M. Dorier, G. Antoniu, F. Cappello, M. Snir, L. Orf. “Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O”, In Proc. CLUSTER – IEEE International Conference on Cluster Computing, Sep 2012, Beijing, China. URL: https://hal.inria.fr/hal-00715252

[2] M. Dorier, R. Sisneros, T. Peterka, G. Antoniu, D. Semeraro, “Damaris/Viz: a Nonintrusive, Adaptable and User-Friendly In Situ Visualization Framework”, Proc. LDAV – IEEE Symposium on Large-Scale Data Analysis and Visualization, Oct 2013, Atlanta, USA. URL: https://hal.inria.fr/hal-00859603

[3] M. Dreher, T. Peterka, “Decaf: Decoupled Dataflows for In Situ High-Performance Workflows”, Technical Report, United States, doi:10.2172/1372113, https://www.osti.gov/servlets/purl/1372113/, 2017.

[4] Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y. Choi, S. Klasky, R. Tchoua, J. Lofstead, R. Oldfield, M. Parashar, N. Samatova, K. Schwan, A. Shoshani, M. Wolf, K. Wu, W. Yu, “Hello ADIOS: The Challenges and Lessons of Developing Leadership Class I/O Frameworks”, Concurrency & Computation: Practice and Experience, v.26, n.7, pp. 1453-1473, 2013.

[5] M. Dreher, Bruno Raffin, “A Flexible Framework for Asynchronous In Situ and In Transit Analytics for Scientific Simulations”, ACM/IEEE International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Chicago, IL, 2014.

Work start date: 
October 2019
Keywords: 
parallel/distributed computing, HPC, data analytics, in situ processing, Big Data, HPC/Big Data convergence
Place: 
IRISA - Campus universitaire de Beaulieu, Rennes