|
|
|
|
|
Optimized Algorithms on Parallel Specialized Architectures
ReMiX project: Reconfigurable memory for indexing genomic banks Participants: Dominique Lavenier, Pierre Peterlongo, Gilles Georges, Julien Jacques, Stéphane Rubini.
BLAST has steadily become the reference software for exploring genomic banks. This type of algorithm and many other algorithms such as PATTERNHUNTER or CHAOS proceed in two steps: first they seek for anchors, then they extend them into alignments. The load balancing between this two tasks depends on the quality of the anchors.
More generally, the problem of mining genomic banks is either bounded by the data access (the time for scanning all the banks) or the computation time (the time to detect good anchors). We address this problem following two complementary ways: (1) speeding-up the anchor detection using reconfigurable hardware; (2) speeding-up the data access using indexing techniques ambedded with FLASH technology.
Compared to the previous RDISK project, the ReMiX project goes one step further by addressing the data access problem. The idea, here, is not to duplicate disk accesses, but to propose a hardware mechanism allowing fast random accesses to Gigabytes of data. In that way, indexing techniques accessing only a fraction of the bank become highly efficient.
In the ReMIX architecture, hard drives (RDISK project) are replaced by FLASH memories whose access time are 2 or 3 orders of magnitude shorter. In the same way, data bandwidth is increased by accessing simultaneously a large number of FLASH memories. As in the RDISK project, data are processed on-the-fly by reconfigurable hardware directly connected to the memory. Note that the reconfigurable index memory does not fit in the addressing space of the processor but it is indirectly accessed by specific queries. The whole system hold 512 Gbytes of FLASH memory.
Two main genomic applications have been implemented, illustrating the potential of the ReMIX concept. In the first application, the FLASH memory contains the index of large DNA banks, allowing fast retreival compared to traditionnal approaches (speed up ranging from 20 to 50). The second application is related to intensive comparison of a large set of proteins against the human genome. The two datasets are fully indexed and compared using the FLASH memory as a temporary storage support. Again, high performances have been exhibited: plugging only one ReMIX board on a standard PC decreases the computation time by about 50.
In 2007, we have addressed the problem of optimizing data indexing of huge banks into the ReMIX memory. This has been performed through an Arc INRIA (called FLASH). Our research has been oriented in direction of subset seeds, based on the work of L. Noé and G. Kucherov (LIFL). A subset seed is composed by non-consecutive flexible characters. This kind of seed leads to better results than standard seeds but their design is much more difficult. Thus, in order to quickly compute subset seeds or a set of subset seeds characteristics, we have built large bank of exhaustive alignments by intensive computing of Smith-Waterman alignments. This corpus, with the use of the YASS program developed in the Sequoia team (LIFL), guided us for constructing several efficient sets of subset seeds, having a sensibility slightly better that the one used by Blastp. Those seeds were implemented on the ReMIX machine. The tests performed allowed us to notice a 16X speed-up factor. This speed-up is obtained thanks to a 13 speed-up hardware factor, while the pure algorithmic approach leads to a 24% additional speed-up.
Parallelization of bioinformatics computation Participants: Dominique Lavenier, Van Hoa Nguyen, Guillaume Rizk, Anthony Assi, Alexandre Cornu, Gilles Georges, Julien Jacques, Ricardo Ascensio, Xianchun Ye.
This section details the different hardware we are currently investigating for supporting parallelization of various bioinformatics computations. We successively present FPGA, GPU, multicore and cluster approaches.
FPGA. With the increasing amount of available complete genomes, the need for inter or intra genome comparison is now a reality and supercomputer manufacturers now propose to include dedicated accelerator boards in their machines. Through the ANR PARA project, in cooperation with the BULL R&D team (Les Clayes sous bois), we are currently designing a reconfigurable accelerator tightly interconnected to their system.
Even if FPGA component offer consequent processing power, one challenge remains to have these resources fed from the main memory at a very high speed. The PCIexpress interface, and especially the second PCIe generation, achieves this goal by providing an agregated bandwidth of 10 Gbytes/sec. In 2007, we have developped the concept of "reconfigurable interface" for automatically generating high performance PCIexpress channels between reconfigurable operators and host processor. From a high description language, based on smart transfer actors, a generator will deliver a synthetizable VHDL code able to achieve high speed communication trough DMA channels. In the context of the PARA project, a parallel operator made of 512 pico-processors dedicated to protein sequence comparison will be connected to this "reconfigurable interface".
GPUs. We have adapted the seed heuristic algorithm on one of the latest available graphics board (NVIDIA GeForce 8800 GTX). This adaptation is not straightforward. It needs to exhibit important low-level parallelism suitable to multithread execution. Starting from a matrix implementation on GPU (using the CUDA programming model), the seed-heuristics code has been modified to fit with the execution model of the GPU chip (a mutiprocessor SIMD architecture). Performances show a speed-up ranging from 5 to 10. These first experiments are very promising and a PhD thesis has been started on this topic (G. Rizk).
Multicore chip. For the last 2-3 years, processor performance growth have been limited due to the difficulty of steadily increasing the clock frequency. Manufacturers and worldwild R&D team already envision multicore processors of hundred (or more) processors. These new architectures will really be efficient for high performance computation if code is targeted for these hardware structures.
Having these objectives in mind, we have revisited several bioinformatics codes in order to make them suitable for an efficient implementation on these architectures. In october 2007, we have started a collaboration with ICT (Institute for Computing Technology), Beijing, China, for implementing a parallel code for intensive genomic sequence comparison on the Godson-T multicore chip developed in this Institute.
Cluster. In the Europeen ACGT project we actively participate to the development of a simulator for modelling cancer tumor development. This activity, referred as "In Silico Oncology" is a complex and multiscale combination of sciences and technologies in order to simulate malignant tumour growth and normal tissue response to therapeutic modalities at all levels of biocomplexity. The aim is to better understand cancer and related phenomena and to optimize therapeutic interventions by performing in silico (on the computer) experiments based on the individual data (clinical, imaging, histopathologic, molecular) of the patient.
We have in charge parallelizing the OncoSimulator code on the ACGT grid, and more specifically on nodes having huge computational resources (cluster) either for providing a better interactivity (shorter response time) or for providing a better accuracy of the model in a reasonable amount of time.
Efficient algorithms based on classical optimization technics Participants: Rumen Andonov, Nicola Yanev.
This research axis aims to design efficient algorithms based on classical optimization technics (like Dynamic Programing (DP), Branch&Bounds (B&B), Lagrangian Relaxation (LR), diverse Heuristics etc) for solving well known and largely studied optimization problem : the knapsack problem. Our last paper presents a preprocessing procedure for the 0-1 multidimensional knapsack problem. First, a non-increasing sequence of upper bounds is generated by solving LP-relaxations. Then, a non-decreasing sequence of lower bounds is built using dynamic programming. The comparison of the two sequences allows either to prove that the best feasible solution obtained is optimal, or to fix a subset of variables to their optimal values. In addition, a heuristic solution is obtained. Computational experiments with a set of large-scale instances show the efficiency of our reduction scheme. Particularly, it is shown that our approach allows to reduce the CPU time of the leading commercial software CPLEX of ILOG.
|
|