Vous êtes ici

Thread convergence prediction for SIMT architectures

Equipe et encadrants
Département / Equipe: 
DépartementEquipe
Site Web Equipe: 
https://team.inria.fr/pacap/
Directeur de thèse
André Seznec
Co-directeur(s), co-encadrant(s)
Sylvain Collange
Contact(s)
NomAdresse e-mailTéléphone
Sylvain Collange
sylvain.collange@inria.fr
+33 2 99 84 71 05
Sujet de thèse
Descriptif

Graphics Processing Units (GPUs) are no longer special-purpose units dedicated to image processing only. They are used as accelerators for many application domains including most of high-performance computing, but also big-data analytics. Modern GPUs are essentially highly-parallel programmable computing engines.

At runtime the GPU groups threads of SPMD programs and synchronizes them to execute the same instruction at the same time. Such a group of threads is referred to as a warp. This execution model, referred to as Single-Instruction, Multiple-Thread (SIMT) by NVIDIA, enables the use of energy-efficient SIMD execution units by factoring out control logic such as instruction fetch and decode pipeline stages for a whole warp. SIMT execution is the key enabler for the energy efficiency of GPUs.

As threads within a warp may follow different directions through conditional branches in the program, the warp must follow each taken path in turn, while disabling individual threads that do not participate. Following divergence, current GPU architectures attempt to restore convergence at the earliest program point following static annotations in the binary. However, this policy has been shown to be suboptimal in many cases, in which later convergence improves performance [4]. In fact, optimal convergence points depend on dynamic program behavior, so static decisions are unable to capture them.

The goal of this thesis is to design predictors that enable the microarchitecture to infer dynamic code behavior and place convergence points appropriately. Convergence predictors have analogies with branch predictors [5] and control independence predictors [3] studied in superscalar processor architecture, but they present one additional challenge: the thread runaway problem. Although a branch misprediction will be identified and repaired locally, a wrong thread scheduling decision may go unnoticed and delay convergence by thousands of instructions. To address the thread runaway problem, we plan to explore promise-based speculation and recovery strategies. When no information is available, we follow the traditional conservative earliest-convergence scheduling policy.
Once the predictor has enough information to make a more aggressive prediction, it generates assumptions about the prediction. The microarchitecture then keeps checking dynamically whether the assumptions actually hold true in the near future. If assumptions turn out to be wrong, the prediction will be reconsidered by changing back priorities to conservative. Such promise-based speculation policies can address the thread runaway problem by fixing a bound on the worst-case performance degradation of an aggressive scheduling policy against the conservative baseline.

Accurate thread convergence policies will enable dynamic vectorization to adapt to application characteristics dynamically. They will both improve performance and simplify programming of many-core architectures by alleviating the need for advanced code tuning by expert programmers.

Bibliographie

[1] Nicolas Brunie, Sylvain Collange, and Gregory Diamos. Simultaneous Branch and Warp Interweaving for Sustained GPU Performance. In 39th Annual International Symposium on Computer Architecture (ISCA), pages 49 – 60, Portland, OR, United States, 2012.
[2] Sajith, Kalathingal, Sylvain Collange, Bharath N. Swamy, and André Seznec. Dynamic inter-thread vectorization architecture: extracting DLP from TLP. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 18-25. IEEE, 2016.
[3] Jamison D. Collins, Dean M. Tullsen, and Hong Wang. Control flow optimization via dynamic re-convergence prediction. In IEEE/ACM International Symposium on Microarchitecture, pages 129–140. IEEE Computer Society, 2004.
[4] Wilson WL Fung, Ivan Sham, George Yuan, and Tor M Aamodt. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Transactions on Architecture and Code Optimization (TACO), 6(2):7, 2009.
[5] André Seznec and Pierre Michaud. A case for (partially) TAgged GEometric history length branch prediction. Journal of Instruction Level Parallelism, 8:1–23, 2006.

Début des travaux: 
09/2017
Mots clés: 
Computer architecture, GPU, microprocessor, SIMT, divergence
Lieu: 
IRISA - Campus universitaire de Beaulieu, Rennes