You are here

Federated queries for Precision Medicine

Team and supervisors
Department / Team: 
Team Web Site: 
http://www.irisa.fr/dyliss
PhD Director
Olivier Dameron
Co-director(s), co-supervisor(s)
Contact(s)
NameEmail addressPhone Number
Olivier Dameron
olivier.dameron@univ-rennes1.fr
02 99 84 74 46
PhD subject
Abstract

Precision medicine aims at tailoring medical treatments to the individual characteristics of each patient [1]. It typically consists in selecting appropriate and optimal therapies based on a combination of the patient’s genetic, molecular, cellular and phenotypic profile [2]. Reconciling the diseases complexity with patient-specific data requires an integrated approach [3] and remains a major open challenge for data science [4]. There are currently more than 1500 life science reference databases [5] complementary and necessary for the scientific activity [6]. However, most of these databases have been designed independently, have heterogeneous schemas and rely on technologies that do not support their interoperability [7].

Over the last decade, Semantic Web technologies have established a relevant framework for addressing both the interoperability and the scalability issues [8, 9]. It resulted in the emergence of the Linked Data initiative [10] that allows to combine data from multiple RDF repositories [11-15]. Currently, Life science is the biggest and most dense subdomain of the LOD cloud, with the ongoing conversion of its references databases into triplestores [16-18] and the creation of comprehensive dataset repositories such as BioPortal [19]. This is a cornerstone for enabling cancer genomics discovery at the petabyte scale [20]. The SPARQL query language offers an unified access to these RDF datasets [21], which makes it even easier to retrieve information from these complementary datasets [22, 23].

Federated SPARQL queries facilitate the combination of information from multiple triplestores by providing an unified view on their datasets [24-26]. However, performances are currently a hot open challenge for SPARQL query engines [27, 28]. Reconciling (1) the need for rich queries, (2) the capability to combine datasets, (3) the volume and complexity of data, and (4) acceptable response time is therefore a major IT challenge. Life science are the ideal domain for developing a general breakthrough as they have both extensive highly-connected datasets and strong application needs.

This PhD thesis aims at delivering an autonomous federated SPARQL queries engine capable of handling complex life science queries.

We hypothesize that computing indexes for the endpoints allows to improve the query processing time both by avoiding unnecessary subqueries and by reducing the costly join operations when reconciling the result fragments returned by the endpoints. Preliminary works gave encouraging results and suggest that the more complex the query, the more this approach pays off.

The PhD thesis elaborates on the INRIA FederatedQueryScaler project (2017–2018), and will benefit from the MoDaL project supported by Biogenouest and the “RDF datahub for precision medicine” supported by CominLabs -both starting in 2019- for providing relevant scenarii and datasets.

Bibliography

[1] Euan A Ashley et al.. Clinical assessment incorporating a personal genome. Lancet (London, England), 375(9725):1525–1535, 2010.

[2] Atul J Butte. It takes a genome to understand a village: Population scale precision medicine. Proceedings of the National Academy of Sciences of the United States of America, 113(44):12344–12346, 2016.

[3] Xiaoping Liu, Yuetong Wang, Hongbin Ji, Kazuyuki Aihara, and Luonan Chen. Personalized characterization of diseases using sample-specific networks. Nucleic acids research, 44(22):e164, 2016.

[4] Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A Kalinin, Brian T Do, Gregory P Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M Hoffman, Wei Xie, Gail L Rosen, Benjamin J Lengerich, Johnny Israeli, Jack Lanchantin, Stephen Woloszynek, Anne E Carpenter, Avanti Shrikumar, Jinbo Xu, Evan M Cofer, Christopher A Lavender, Srinivas C Turaga, Amr M Alexandari, Zhiyong Lu, David J Harris, Dave DeCaprio, Yanjun Qi, Anshul Kundaje, Yifan Peng, Laura K Wiley, Marwin H S Segler, Simina M Boca, S Joshua Swamidass, Austin Huang, Anthony Gitter, and Casey S Greene. Opportunities and obstacles for deep learning in biology and medicine. Journal of the Royal Society, Interface, 15(141), 2018. In press.

[5] Michael Y Galperin, Daniel J Rigden, and Xosé M Fernández-Suárez. The 2015 nucleic acids research database issue and molecular biology database collection. Nucleic acids research, 43(Database issue):D1–D5, 2015.

[6] Nicola Cannata, Emanuela Merelli, and Russ B. Altman. Time to organize the bioinformatics resourceome. PLoS Computational Biology, 1(7):0531–0533, 2005.

[7] David Gomez-Cabrero, Imad Abugessaisa, Dieter Maier, Andrew Teschendorff, Matthias Merkenschlager, Andreas Gisel, Esteban Ballestar, Erik Bongcam-Rudloff, Ana Conesa, and Jesper Tegnér. Data integration in the era of omics: current and future challenges. BMC systems biology, 8 Suppl 2:I1, 2014.

[8] Nigel Shadbolt, Wendy Hall, and Tim Berners Lee. The semantic web revisited. IEEE Intelligent Systems, pages 96–101, 2006.

[9] Tim Berners Lee, Wendy Hall, James A. Hendler, Kieron O’Hara, Nigel Shadbolt, and Daniel J. Weitzner. A framework for web science. Foundations and Trends in Web Science, 1(1):1–130, 2007.

[10] Christian Bizer, Tom Heath, and Tim Berners Lee. Linked data–the story so far. International Journal on Semantic Web and Information Systems, 5(3):1–22, 2009.

[11] Susie Stephens, David LaVigna, Mike DiLascio, and Joanne Luciano. Aggregation of bioinformatics data using semantic web technology. Journal of Web Semantics, 4(3), 2006.

[12] Satya S. Sahoo, Olivier Bodenreider, Kelly Zeng, and Amit Sheth. An experiment in integrating large biomedical knowledge resources with RDF: Application to associating genotype and phenotype information. In Proceedings of the WWW2007 Workshop on Health Care and Life Sciences Data Integration for the Semantic Web, 2007.

[13] Evelyn Camon, Michele Magrane, Daniel Barrell, Vivian Lee, Emily Dimmer, John Maslen, David Binns, Nicola Harte, Rodrigo Lopez, and Rolf Apweiler. The gene ontology annotation (goa) database: sharing knowledge in uniprot with gene ontology. Nucleic acids research, 32(Database issue):D262–D266, 2004.

[14]David P Hill, Nico Adams, Mike Bada, Colin Batchelor, Tanya Z Berardini, Heiko Dietze, Harold J Drabkin, Marcus Ennis, Rebecca E Foulger, Midori A Harris, Janna Hastings, Namrata S Kale, Paula de Matos, Christopher J Mungall, Gareth Owen, Paola Roncaglia, Christoph Steinbeck, Steve Turner, and Jane Lomax. Dovetailing biology and chemistry: integrating the Gene Ontology with the ChEBI chemical ontology. BMC genomics, 14:513, 2013.

[15] Kevin M Livingston, Michael Bada, William A Baumgartner, and Lawrence E Hunter. Kabob: ontology-based semantic integration of biomedical databases. BMC bioinformatics, 16:126, 2015.

[16] Nicole Redaschi and Consortium UniProt. UniProt in RDF: Tackling data integration and distributed annotation with the semantic web. In 3rd International Biocuration Conference, 2009. Available from Nature Precedings

[17] Simon Jupp, James Malone, Jerven Bolleman, Marco Brandizi, Mark Davies, Leyla Garcia, Anna Gaulton, Sebastien Gehant, Camille Laibe, Nicole Redaschi, Sarala M Wimalaratne, Maria Martin, Ewan Birney, and Andrew M Jenkinson. The ebi rdf platform: linked open data for the life sciences. Bioinformatics (Oxford, England), 30(9):1338–1339, 2014.

[18] Neil Swainston, Janna Hastings, Adriano Dekker, Venkatesh Muthukrishnan, John May, Christoph Steinbeck, and Pedro Mendes. libchebi: an api for accessing the chebi database. Journal of cheminformatics, 8:11, 2016.

[19] Patricia L Whetzel, Natalya F Noy, Nigam H Shah, Paul R Alexander, Csongor Nyulas, Tania Tudorache, and Mark A Musen. BioPortal: enhanced functionality via new web services from the national center for biomedical ontology to access and use ontologies in software applications. Nucleic acids research, 39(Web Server issue):W541–W545, 2011.

[20] Jovan Cejovic, Jelena Radenkovic, Vladimir Mladenovic, Adam Stanojevic, Milica Miletic, Stevan Radanovic, Dragan Bajcic, Dragan Djordjevic, Filip Jelic, Milos Nesic, Jessica Lau, Patrick Grady, Nick Groves-Kirkby, Deniz Kural, and Brandi Davis-Dusenbery. Using semantic web technologies to enable cancer genomics discovery at petabyte scale. Cancer informatics, 17:1176935118774787, 2018.

[21] Manuel Salvadores, Matthew Horridge, Paul R Alexander, Ray W Fergerson, Mark A Musen, and Natalya F Noy. Using SPARQL to query Bioportal ontologies and metadata. In Proceedings of the International Semantic Web Conference ISWC 2012, volume 7650 of Lecture Notes in Computer Science, pages 180–195, 2012.

[22] Huajun Chen, Tong Yu, and Jake Y Chen. Semantic web meets integrative biology: a survey. Briefings in bioinformatics, 14(1):109–125, 2012.

[23] Hirokazu Chiba, Hiroyo Nishide, and Ikuo Uchiyama. Construction of an ortholog database using the semantic web technology for integrative analysis of genomic data. PloS one, 10(4):e0122802, 2015.

[24] Kei-Hoi Cheung, H Robert Frost, M Scott Marshall, Eric Prud’hommeaux, Matthias Samwald, Jun Zhao, and Adrian Paschke. A journey to semantic web query federation in the life sciences. BMC bioinformatics, 10 Suppl 10:S10, 2009.

[25] Marija Djokic-Petrovic, Vladimir Cvjetkovic, Jeremy Yang, Marko Zivanovic, and David J Wild. PIBAS FedSPARQL: a web-based platform for integration and exploration of bioinformatics datasets. Journal of biomedical semantics, 8(1):42, 2017.

[26] Thierry Lombardot, Anne Morgat, Kristian B Axelsen, Lucila Aimo, Nevila Hyka-Nouspikel, Anne Niknejad, Alex Ignatchenko, Ioannis Xenarios, Elisabeth Coudert, Nicole Redaschi, and Alan Bridge. Updates in rhea: Sparqling biochemical reaction data. Nucleic acids research, 2018. In press.

[27] Ali Hasnain, Qaiser Mehmood, Syeda Sana E Zainab, Muhammad Saleem, Claude Warren, Durre Zehra, Stefan Decker, and Dietrich Rebholz-Schuhmann. BioFed: federated query processing over life sciences linked open data. Journal of biomedical semantics, 8(1):13, 2017.

[28] Yasar Khan, Muhammad Saleem, Muntazir Mehdi, Aidan Hogan, Qaiser Mehmood, Dietrich Rebholz-Schuhmann, and Ratnesh Sahay. SAFE: SPARQL federation over rdf data cubes with access control. Journal of biomedical semantics, 8(1):5, 2017.

 

Work start date: 
octobre 2019
Keywords: 
Semantic Web, federated queries, linked open data, RDF, SPARQL, precision medicine
Place: 
IRISA - Campus universitaire de Beaulieu, Rennes