Learning grammars on genomic sequences Print
Internship subject:

Using a linguistic approach for modeling genomic sequences has been advocated for a long time by David Searls [1]. Models may sometimes be designed by experts. In the team, we study how to automatically design these models by machine learning and we have proposed a successful approach for learning automata on protein sequences [2,3]. The subject of the internship is to study how this approach can be extended to learn more expressive grammars [4,5,6] allowing to model more easily long distance correlations. The proposed algorithm will be implemented and tested on real genomic datasets.

Keywords: Machine learning, Bioinformatics, Formal Grammars

Duration: 6 months

Prerequisites: Master studies in computer science or equivalent (this is a research subject: applicants should be able to continue with a PhD thesis after the internship)

