Bridging Supercomputers and Clouds at the Exascale Era Through Elastic Storage

Publié le mer 20/01/2021 - 16:01
Chercheurs et encadrants
Equipe de recherche
François Tessier
Unite de recherche
Directeur.trice de thèse
Gabriel Antoniu
Unite de recherche
François Tessier
Gabriel Antoniu
Unité de recherche
Description de la thèse
Weather forecasting is one of the many areas where accuracy requirements can no longer be met solely by simulations running on a supercomputer. Now, production workflows such as the ECMWF workflow for weather forecast, which provides data to national institutions like MétéoFrance, tend to mobilize all the components of what is known as the digital continuum and all the computational techniques that can improve their predictions. Thereby, data generation and computation are carried out at the Edge computing level, on supercomputers and in the Cloud, and require the use of traditional simulation but also advanced machine learning algorithms or stream processing techniques. This evolution of scientific applications towards large-scale complex workflows has contributed to what has come to be known as a "data deluge". Ever-increasing amounts of data are read and written, whether it is to produce more accurate results or to feed new types of algorithms across the domains of artificial intelligence or data analytics. On supercomputers, which provide the traditional computing infrastructure for scientific workloads, this shift from a compute-centric to a data-centric paradigm has highlighted important data movement issues, in particular regarding I/O on storage systems.
Indeed, even though HPC systems are increasingly powerful, there has been a relative decline in I/O bandwidth. Over the past ten years, the ratio of I/O bandwidth to computing power of the top three supercomputers has been divided by 9.6 while in some scientific computing centers the volume of data stored has been multiplied by 41 [1]. An aspect that accentuates this gap comes from the design of the machines themselves: while it is common for HPC systems to provide exclusive and dynamic access to compute nodes through a batch scheduler, storage resources are usually global and shared by concurrent applications leading to congestion and performance variability [2,3]. To mitigate this congestion, new tiers of memory and storage have been added to recently deployed large-scale architectures, increasing their complexity. These new tiers can take the form of node-local SSDs, burst buffers or dedicated storage nodes with network-attached storage technologies, to name a few. Harnessing this additional storage capacity is an active research topic but little has been done about how to efficiently provisioning it [4].
Nevertheless, while for years high-performance computing (HPC) systems were the predominant means of meeting the requirements expressed by large-scale scientific workflows, today some components have moved away from supercomputers to Cloud-type infrastructures [5]. This migration has been mainly motivated by the Cloud's ability to perform data analysis tasks efficiently. From an I/O and storage perspective, the world of Cloud computing is very different from on-premise supercomputers: direct access to resources is extremely limited due to a very high level of abstraction. Instead, we have access to various storage systems, potentially geographically distributed, that use these resources. Another major difference is that, unlike HPC systems, cloud storage, network and computing resources have a certain elasticity and can be allocated [6]. Eventually, while the cost of using a supercomputer from the user's point of view is essentially expressed in node-hours deducted from a grant, access to the Cloud follows a pay-as-you-go model that must be taken into account, as data movements in particular are costly.
Thus, dealing with this high degree of heterogeneity distributed between two worlds with very different philosophies is a real challenge for scientific workflows and applications. This PhD thesis aims to address this issue through the point of view of the resource provisioning. Through intelligent scheduling algorithms, we want to enable workflows to seamlessly use elastic storage systems [7] on hybrid infrastructures combining HPC systems and Cloud. Multiple criteria can be taken into account beyond the only performance aspect such as financial cost or energy. These algorithms will need to rely on a resource abstraction model that also need to be devised. Collaborations (e.g. with Argonne National Laboratory, USA) will be able to bring a dose of artificial intelligence to the imagined scheduling algorithms, for example with reinforcement learning. In general, there will be a strong emphasis on international collaborations during this PhD thesis.
The PhD position is mainly based in Rennes, at IRISA/Inria within the KerData research team. The selected candidate will have the opportunity to join a very dynamic group in a stimulating work environment with a lot of active national, European and international collaborations as part of cutting-edge international projects in the areas of Exascale Computing, Cloud Computing, Big Data and Artificial Intelligence. The candidate is also expected to be hosted for 3-6 month internships abroad to strengthen the international visibility of his/her work and benefit from the expertise of other researchers in the field.
Requirements of the candidate
- An excellent Master degree in computer science or equivalent
- Strong knowledge of distributed systems
- Knowledge on storage and (distributed) file systems
- Ability and motivation to conduct high-quality research, including publishing the results in relevant venues
- Strong programming skills (Python, C/C++)
- Working experience in the areas of Big Data management, Cloud computing, HPC, is an advantage
- Very good communication skills in oral and written English.
- Open-mindedness, strong integration skills and team spirit

How to apply?

Send an email with a cover letter, CV, contact address of at least two references (internship, teacher in a related field, …) and copies of degree certificates to Dr. François Tessier and Dr. Gabriel Antoniu. Incomplete applications will not be considered or answered.

Début des travaux
dès que possible
Inria Rennes - Bretagne Atlantique, France
[1] GK. Lockwood, D. Hazen, Q. Koziol, RS. Canon, K. Antypas, and J. Balewski. "Storage 2020: A Vision for the Future of HPC Storage". In: Report: LBNL-2001072. Lawrence Berkeley National Laboratory, 2017
[2] O. Yildiz, M. Dorier, S. Ibrahim, R. Ross, and G. Antoniu. "On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems". In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2016, pp. 750–759
[3] F. Tessier, V. Vishwanath. "Reproducibility and Variability of I/O Performance on BG/Q: Lessons Learned from a Data Aggregation Algorithm". United States: N. p., 2017. Web. doi:10.2172/1414287
[4] F. Tessier, M. Martinasso, M. Chesi, M. Klein, M. Gila. "Dynamic Provisioning of Storage Resources: A Case Study with Burst Buffers". In: IPDPSW 2020 - IEEE International Parallel and Distributed Processing Symposium Workshops, May 2020, New Orleans, United States.
[5] G. Antoniu et al. ETP4HPC’s SRA 4: Strategic Research Agenda for High-Performance Computing in Europe. 2020
[6] P. Ruiu, G. Caragnano, and L. Graglia. "Automatic Dynamic Allocation of Cloud Storage for Scientific Applications". In: 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems. 2015, pp. 209–216.
[7] N. Cheriere. "Towards Malleable Distributed Storage Systems: From Models to Practice". Theses. École normale supérieure de Rennes, Nov. 2019.
HPC/Cloud convergence, storage, scheduling, abstraction
Année de début de thèse