You are here

Dynamic Binary Translation for Coarse Grain Reconfigurable architectures

Team and supervisors
Department / Team: 
DepartmentTeam
Team Web Site: 
https://team.inria.fr/cairn/
PhD Director
Steven Derrien
Co-director(s), co-supervisor(s)
Contact(s)
NameEmail addressPhone Number
Steven Derrien
steven.derrien@irisa.fr
0299847460
PhD subject
Abstract

With the end of Dennard scaling, heterogeneous multi-cores (e.g. mixing embedded CPUs and DSPs) have proven to be an attractive approach to explore better trade-off between performance and energy. However, such heterogeneity has also many drawbacks: (i) programming is more challenging, (ii) dynamic workloads balancing (using task migrations) is much less flexible.

To address those shortcomings, hardware vendors propose to hide architectural heterogeneity to programmers and runtimes through a homogeneous programming model by using the same ISA for all cores. This is the case of the ARM big.LITTLE architecture [5] which combines within a single platform high-performance Out of Order cores (big) with simpler, lower power in-order micro-architectures (LITTLE). Thanks to binary compatibility between cores, programming and runtime management are greatly simplified.

Although the combined use of heterogeneity and Dynamic Frequency Voltage Scaling enables subtle performance/energy trade-offs, such a platform does not really take advantage of the diversity encountered in workloads. For example, modern embedded application workloads consist of many hotspots which can range from control-dominated kernels to compute-intensive ones. Whereas OoO cores are a perfect match for the former, the later could make a better target for hardware acceleration using Coarse Grain Reconfigurable Architectures.

The goal of this PhD thesis is to propose a new kind of heterogeneous multi-core, in which the OoO/in-order heterogeneity is enriched with VLIW cores (as in NVidia's Denver) and CGRA accelerators. Such specialized cores help processing compute-intensive workloads for a significantly lower energy budget than for their OoO counterparts, however they break the single ISA property: in both VLIW and CGRA architectures, Instruction Level Parallelism (ILP) must be explicit in the binary code.

We propose to alleviate this issue by resorting to Dynamic Binary Translation (DBT). In our context, we use DBT to translate from a host ISA (RISC-like) to a guest CGRA accelerator. This approach has already been investigated in the context of VLIW, but applying DBT to CGRAs poses even greater challenge due to the lack of instruction set, the need for of a place & route stage, and the possibility offered by dynamic reconfiguration. As an example, existing scheduling/placement/routing algorithm may need to run for tens of minutes to find an optimized solution [10]. Because of this, only simple heuristics can be used [2, 8, 13] when the compilation has to be performed online.

More specifically this PhD will address the problem of efficient dynamic compilation for VLIW and CGRA [9] cores. The idea is to propose a meet in the middle solution by considering the tradeoffs at both the software (i.e compiler) and hardware levels to offer a practical solution in a DBT context. The work will build on the Hybrid-DBT [11] toolchain, which supports runtime time translation of RISC V binaries on a VLIW guest. 

Bibliography

[1] D. Boggs, G. Brown, N. Tuck, et K. S. Venkatraman, « Denver: Nvidia’s First 64-bit ARM Processor », IEEE Micro, 2015

[2] M. Brandalero et A. C. S. Beck, « A Mechanism for Energy-efficient Reuse of Decoding and Scheduling of x86 Instruction Streams », in DATE’17, Lausanne, 2017.

[3] J. C. Dehnert et al., « The Transmeta Code Morphing™ Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges », in CGO’03

[4] J. A. Fisher, P. Faraboschi, et C. Young, Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Elsevier, 2005.

[5] P. Greenhalgh, « Big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7 », ARM White Paper, 2011.

[6] A. Grudnitsky, L. Bauer, et J. Henkel, « Efficient Partial Online Synthesis of Special Instructions for Reconfigurable Processors », IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, nᵒ 2, p. 594‑607, févr. 2017.

[7] J. L. Hennessy et D. A. Patterson, Computer Architecture, Fifth Edition: A Quantitative Approach, 5th éd. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.

[8] W. Kim, Y. Choi, et H. Park, « Fast modulo scheduler utilizing patternized routes for coarse-grained reconfigurable architectures », ACM Transactions on Architecture and Code Optimization (TACO), vol. 10, nᵒ 4, p. 58, 2013.

[9] B. Mei, S. Vernalde, D. Verkest, H. D. Man, et R. Lauwereins, « ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix », in Field Programmable Logic and Application, 2003, p. 61‑70.

[10] H. Park, K. Fan, S. A. Mahlke, T. Oh, H. Kim, et H. Kim, « Edge-centric modulo scheduling for coarse-grained reconfigurable architectures », in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, 2008,

[11] S. Rokicki, E. Rohou, et S. Derrien, « Hardware-Accelerated Dynamic Binary Translation », in Design, Automation Test in Europe Conference Exhibition (DATE), 2017

[12] S. Rokicki, E. Rohou, et S. Derrien, « Dynamic Binary Translation for Heterogeneous Multi-Cores ».

[13] M. A. Watkins, T. Nowatzki, et A. Carno, « Software Transparent Dynamic Binary Translation for Coarse-Grain Reconfigurable Architectures », in HPCA, 2016.

Work start date: 
1/10/2018
Keywords: 
heterogeneous multi-cores ; VLIW; CGRA accelerator; Dynamic Binary Translation ; energy efficiency
Place: 
IRISA - Campus universitaire de Beaulieu, Rennes