Senior Researcher. Intel MRL IA32 Architecture Unit.
Intel Microcomputer Research Labs
5350 NE Elam Young Parkway
Hillsboro, OR 97124-6461, USA
phone: (503) 696-3857
fax: (503) 696-1442
Étude du parallélisme monolithique : cas du multiflot simultané
Doctoral Dissertation (in French) June 1997 : file.ps.gz
Several millions of transistors can already be integrated on a single
circuit. On the other hand, the internal clock frequencies of microprocessors
are increasing steadily. The gap between the internal clock (on the chip)
and the external clock (on the motherboard) is continuously growing, leading
to huge relative access time to the memory. To exploit these technological
data, several forms of parallelism will have to be developed and integrated
on the chip. Among the techniques dealing with instruction parallelism
and thread parallelism, simultaneous multithreading (SMT) appears to be
one of the most promising.
An SMT microprocessor allows the simultaneous execution of several instruction
streams in a shared superscalar pipeline. The latter is thus used at its
maximum. The simultaneous execution of several threads, however, implies
new constraints at the architectural level which are important to examine
The work presented in this thesis allowed us to show that branch prediction
tables can be shared by different threads, whether the workload is constituted
of independant applications or of a unique parallel program. However, having
a private return address stack per thread highly enhances the prediction
accuracy. The memory hierarchy appeared to be a far more critical subject.
It is the memory hierarchy's parameters which set up the maximum degree
over which multithreading is no more cost-effective. In order to have the
best performance, it is particularly important to have associative first
level caches and small bloc sizes. However, the contention on the second
level cache should limit the interest of multithreading to a few threads.
Lastly, we show that with only 4 threads, an architecture featuring simultaneous
multithreading can rely on a simple in-order execution. The performance
gain brought by an out-of-order execution is indeed too weak to justify
the implementation of complex mechanisms.
Branch prediction and simultaneous multithreading
Branch prediction strategies for superscalar architectures now achieve
more than 90% accuracy. We explore the impact on the branch prediction
accuracy of the simultaneous use of prediction tables by several threads.
We particularly try to characterize whether or not the threads take advantage
of sharing large prediction structures for multiprogramming processing
as well as for parallel applications. We also examine the usefulness of
providing one private Return Address Stack per active thread.
S. Hily, A. Seznec ``
Branch Prediction and Simultaneous Multithreading'', 25 pages, IRISA
Report No 997, March 1996. Short paper appeared in PACT'96,
Memory hierarchy and simultaneous multithreading
Simultaneous multithreading (SMT) is an interesting way of maximizing performance
by enhancing processor utilization. We investigate issues involving the
behavior of the memory hierarchy with SMT. First, we show that ignoring
L2 cache contention leads to strongly over-estimate the performance one
can expect and may lead to incorrect conclusions. We then explore the impact
of various memory hierarchy parameters. We show that the number of supported
threads has to be set-up according to the cache size, that the L1 caches
have to be associative and small blocks have to be used. Then, the hardware
constraints put on the design of memory hierarchies should limit the interest
of SMT to a few threads.
S. Hily, A. Seznec "Standard
Memory Hierarchy Does Not Fit Simultaneous Multithreading", Proceedings
of MTEAC'98 Workshop (in conjunction with HPCA 4) , Feb. 1998
A longer version is available as ``
Contention on 2nd Level Cache May Limit The Effectiveness of Simultaneous
Multithreading", 22 pages, IRISA Report No 1086, Feb. 1997
In-order and out-of-order SMT models
Simultaneous multithreading (SMT) is a promising approach to deliver high
throughput from superscalar pipelines. In this paper, we show that when
executing 4 threads on an SMT processor, out-of-order execution induces
small performance benefits over in-order execution. Then, for application
domains where performance throughput is more important than ultimate performance
on a single application, SMT combined with in-order execution may be a
more cost-effective alternative than ultimate aggressive out-of-order
superscalar processors or out-of-order execution SMT.
S. Hily, A. Seznec " Out-Of-Order
Execution May Not Be Cost-Effective on Processors Featuring Simultaneous
Multithreading ", IRISA Report No 1179, March 1998
All my teaching were for computer science students at the "Institut de
Formation Supérieure en Informatique et Communication" IFSIC/Université
1996-1997: ATER. Attaché Temporaire d'Enseignement et de Recherche.
Equiv. to visiting assistant professor (6 contact hours/week)
Computer Architecture (DIIC2 : ARA1, ARA2)
Compilation technics (DIIC2 : CPL1; DIIC3 : CPL2)
Algorithms (DESSDC : ALG)
PROJET R.I.S.C. Master thesis (1991)
Design and implementation of a MIPSX like microprocessor.
LE COPROCESSEUR OPAC : algorithmes compute-bound, architecture,
DEA (Diplome d'études Approfondies) en Informatique (1992).
Study of the OPAC operator, a coprocessor targeted to execute efficiently
Évaluation de cache. Rapport DRET-INRIA No 93.082 (1993)
Study and development of a software tool to extract automatically memory