You are here

Learning grammars with long-distance correlations on proteins

Team and supervisors
Department / Team: 
PhD Director
Jacques Nicolas
Co-director(s), co-supervisor(s)
François Coste
Contact(s)
NameEmail addressPhone Number
François Coste
francois.coste@inria.fr
(33|2) 99 847 491
PhD subject
Abstract

Proteins, which play a major role in nearly every cellular process, have always been a central focus in biology. Thanks to advances in sequencing technologies, the number of available protein sequences is rapidly increasing but their functional characterization remains a big challenge. To assist  classical in-vivo or in-vitro experimental approaches, computational methods predicting in-silico the function(s) of the sequences have been developed and are now routinely used to annotate newly sequenced genomes, typically with profile hidden Markov Models (pHMM). Despite these advances, the function of a big number of proteins is still unknown or not precise enough, as reported for instance by [1]: “About 16 and 30% of proteins are unannotated in bacteria and yeast genomes. In eukaryotes, over 40% of the proteins encoded by genomes is reported to lack functional annotation” or by [2]: “Even in Arabidopsis thaliana, only approximately 40% of enzyme- and transporter-encoding genes have credible functional annotations, and this number is even lower in nonmodel plants”.

One of the main limitation of models such as the pHMMs used by state-of-the art methods, is that the likelihood of  amino-acids at one position does not depend on amino-acids at other positions, while it is well-known that some distant positions in the sequence can be in contact in the 3D structure of the proteins and co-evolve (see for instance [3]). To get rid of this limitation, we propose in this thesis to study how and when grammatical models capturing long-distance correlations could be inferred by machine learning programs to improve the functional annotation of protein sequences. This study will capitalize on the expertise of our team on learning grammars modelling protein families [4], notably a first successful tool learning automata, modelling essentially local correlations, from protein sequences [5, 6] and recent surprisingly good preliminary results by a simple approach learning context-free grammars capturing more distant, but non-crossing, correlations [7, 8]. The goal will be to design an efficient and characterizable algorithm learning accurate grammars, ideally modelling crossing correlations, on the basis of statistical evidence in available sequences (for instance by direct coupling analysis [9]), and eventually in available 3D structures (for instance in the fragments in contact introduced in [10]), of proteins.

Bibliography

[1] K. H. Dhanyalakshmi, Mahantesha B. N. Naika, R. S. Sajeevan, Oommen K. Mathew, K. Mohamed Sha, Ramanathan Sowdhamini, and Karaba N. Nataraja. An approach to function annotation for proteins of unknown function (pufs) in the transcriptome of indian mulberry. PLOS ONE,  2016.

[2] Thomas D. Niehaus, Antje M.K. Thamm, Valérie de Crécy-Lagard, and Andrew D. Hanson. Proteins of unknown biochemical function: A persistent problem and a roadmap to help overcome it. Plant Physiology, 2015.

[3] Mingcong Wang, Maxim V. Kapralov, and Maria Anisimova. Coevolution of amino acid residues in the key photosynthetic enzyme rubisco. BMC Evolutionary  Biology, 2011.

[4] François Coste. Learning the language of biological sequences. In Topics in Grammatical Inference. Springer-Verlag, 2016.

[5] François Coste and Goulven Kerbellec. A similar fragments merging approach to learn automata on proteins. In 16th European Conference on Machine Learning, 2005.

[6] Goulven Kerbellec. Learning automata modelling families of protein sequences. PhD thesis, Université Rennes 1, June 2008.

[7] François Coste, Gaelle Garet, and Jacques Nicolas. A bottom-up efficient algorithm learning substitutable languages from positive examples. 12th International Conference on Grammatical Inference, 2014.

[8] Gaëlle Garet. Classification and characterization of enzymatic families with formal methods. PhD thesis, Université Rennes 1, December 2014.

[9] Faruck Morcos, Terence Hwa, José N. Onuchic, and Martin Weigt. Direct coupling analysis for protein contact prediction, Springer New York, 2014.

[10] Clovis Galiez. Structural fragments : comparison, predictability from the sequence and application to the identication of viral structural proteins. PhD thesis, Université Rennes 1, December 2015.

Work start date: 
dès que possible
Keywords: 
Machine learning, protein sequences, weighted grammars, statistical evidence, long distance correlations.
Place: 
IRISA - Campus universitaire de Beaulieu, Rennes