anim les activités scientifiques  


formation par la recherche / formation doctorale / enseignement, stages / sujets de thèses


Sujet de thèse proposé à l'Irisa pour la rentrée 2001-2002


anim Scheduling mechanisms for wide-issue dynamic order execution superscalar processors

Localisation : Irisa, Rennes

Equipe(s) : Caps

Responsable : André Seznec / Pierre Michaud (tél. direct : 02 99 84 73 36, email :

The path for performance on superscalar processors should combine instruction level parallelism with high clock frequency. However, constructors are facing obstacles for implementing both features. For instance, a very high clock frequency is enabled through very deep pipeline on Intel Pentium 4 while Alpha EV8 will feature a very wide instruction level parallelism and SMT parallelism with a less aggressive clock. Among the mechanismthat may prevent a very high clock frequency is the scheduling logic (wake-up logic+ selection logic). The scheduling logic determines which instructions have become fireable in the instruction window. The complexity of the scheduling logic increases with the number of instructions that can be issued on each cycle, but also with the depth of the window of instructions waiting for execution. More precisely the hardware complexity of the wake-up logic is directly proportionnal to the number of inputs and outputs of an instruction as well as with the number of instructions in the window: an entry in the instruction window must check the availability of all its entries, the availability of every result must be forwarded through the whole instruction window. It is critical to perform the loop (wake-up logic-selection logic) in a single cycle since this is required to enable direct chaining of a dependent instruction after a one-cycle latency instruction.

We propose to investigate new solutions for reducing the complexity of the scheduling logic for wide-issue dynamic execution superscalar processors.

Two orthogonal directions will be explored. First, current implementations of selection logic assume instructions (or micro-operations) with 2 operands and 1 result. One way to reduce the complexity of the wake-up logic would be to use a larger granularity for firing instructions. A group of dependent instructions might share a single entry in the "instruction group window". The entries would have to be enlarged to support groups of instructions with a set of N operands and M results, but the number of entries would be reduced. A group of instructions would become fireable as a whole. The first advantage of such a technic would be the possibility to eliminate some intermediate result write and read in the register file (and also the propagation of its availability to the wake-up logic). Therefore it might reduce the overall complexity of the scheduling logic. The second advantage of this technic is to releave part of the stress on the loop (wake-up logic-selection logic) since some of the direct chaining of dependent instruction after a one-cycle latency instruction can be hiden in instruction groups. On the other hand, instructions would have to be grouped to efficiently use the "instruction group window". It should be noticed that a trace (or decoded) cache would be the right place to handle such grouping, moreover IA32 ISA should be a good ISA for naturally allow this grouping. Using an "instruction group window" also raises the issue of instruction validation: instructions have to be validated in groups. The study will have to check whether and when this approach is valid. Particularly, the size of the groups of instructions and the number of input operands and results will have to be determined. Strategies for building instruction groups at decode time (trace build time ) will be explored.

Second, we have recently proposed a prescheduling technic [1] to reduce the width of the instruction window that should be searched for fireable instructions. It consists of computing an approximate execution cycle for each instruction and to send instructions to reservation stations based on this predicted schedule: instructions do not compete in the selection logic before they are likely to be fireable. While preliminary results are encouraging [1], some parameters are still to be investigated. Directions for further researches include the use of cache behavior for better prescheduling (including taking in account hit/miss prediction) , using the effective scheduling for computing future prescheduling ( a la trace cache) and applying prescheduling on a clustered architecture. Moreover prescheduling might also be appplied to groups of instructions (as those that could be built above) instead of individual instructions. This might allow to reach better prescheduling heuristics than greedy execution.

Finally, combinations of the techniques described above will have to studied to explore the best tradeoffs between instruction window size, hardware complexity, critical path.

[1] P. Michaud, A. Seznec, "Data-flow Prescheduling for Large Instruction Windows in Out-of-order Processors", Proceedings of HPCA-7, Jan. 2001



dernière mise à jour : 12.03.2001

-- english version --- --- ©copyright --