Hardware volatile states in a modern microprocessor



Execution time of a short sequence of instructions and hardware volatile states

Parameters influencing on the execution time of a sequence of instructions In microprocessor systems built around modern superscalar processors, the precise number of cycles needed to execute a (very short) sequence of instructions is dependent on many internal states of hardware components
inside the microprocessor as well as on events external to the process.
 

Let us consider a simple sequence of instructions:
1.  read of the hardware clock counter
2.  conditional branch
3.  a load
4 . read of the hardware clock counter

The number of cycles for executing this short sequence will depend on:

   1.Correct or incorrect branch prediction (both direction and target)
   2.Hit or miss of the instructions in the ITLB
   3.Hit or miss of the instructions on the instruction cache
   4.Hit or miss of the data load on the data cache
   5.Hit or miss of the data load on the data TLB
   6.In case of miss on one of those caches, hit or miss on the L2 cache.

In addition from these binary status (present/absent or correct/wrong), the execution time of a sequence also depends on the precise status of all instructions in all the stages of the execution pipeline. This status is very complex. For instance, on an in-order execution processor such as the SUN UltraSparc II, up to 4 instructions may be executed per cycle and the pipeline features 9 stages. On out-of-order execution processors such as the Compaq Alpha 21264  or the Intel Pentium 4 , the status is even more complex, since more instructions can be in flight in the pipeline at the same time (up to 80 instructions on the Alpha 21264, more than 100 instructions on the Pentium 4), and the status of each instruction is complex: register renaming, waiting for execution, ...

Moreover modern superscalar processors feature numerous buffers which aim at optimizing performance for instance write buffers, victim buffers and prefetch buffers. The response time of the memory hierarchy servicing a miss depends on the status of all these buffers. Moreover the response time of the memory on a L2 cache miss will depend on any event conflicting on the memory system or on the system bus.
 
 

Operating system invocations and unmonitorable hardware states

Any operating system invocation modifies  the contents of the instruction and data caches, the translation buffers (TLBs), the L2 cache and the branch
prediction structures.

 For  a Sun Workstation featuring an Ultrasparc II and running under Solaris, we report here estimates on the minimum numbers of blocks or entries that are displaced from data and instruction L1 caches, L2 caches and instruction and data TLBs by a single operating system interruption.  Intuitively this represents an minimal evaluation of the perturbation introduced by the interruption. We also report the ``minimum'' cumulated perturbation on 100 consecutive interruptions. These numbers are reported for a non-loaded machine (no other heavy process running) since on a loaded machine more blocks (in average) will be evicted.
 

L1 data caches The UltraSparc L1 data cache is 16Kbyte and direct-mapped. It features 512 32-byte cache sectors. A miss fetches only 16 bytes, the second 16-byte block will be fetched only on demand. The state of sector location is therefore represented by the physical address of the data sector mapped onto it and the presence/absence of the two halves of the sector.

On a non-loaded machine most of the operating system call touch about 80-200 data cache sectors (with a peak around 100-110 cache sectors) while,
depending on the runs, 1-10 % of the operating system calls displace almost all the blocks from the cache. For 100 consecutive interruptions, the number of displaced blocks always exceeded 11,500 in our experiences.

L1 instruction cache and the conditional branch predictor The 16Kbyte instruction cache on the UltraSparc is 2-way set-associative and features 32-byte cache blocks. On the UltraSparc, the branch predictor is incorporated in the I-cache: a 2-bit counter is associated with every pair of instructions and a prediction of the address of the next 4-instruction block is associated with every 4-instruction group.

The state of a cache set can be represented by the ordered set of the addresses of the instruction blocks mapped onto it and the associated branch prediction information. An operating system call will flush down part of the I-cache, and therefore will also flush part of the branch prediction information.

We measured that, on a non loaded UltraSparc machine, most operating system calls displace around 250 32-byte blocks of instructions , while 100 consecutive operating systems displace at least 30,000 blocks.

TLBs The UltraSparc II features a data TLB and an instruction TLB. Both TLBs have 64 entries and are fully associative and feature a Not Last Used replacement policy. The global state of the TLB can be represented by the set of the addresses of the pages mapped by the TLB and the state of the logic needed for implementing the replacement policy.

We experimentally measured that, on a non loaded machine, every operating system invocation displaces a significant amount of data TLB entries (minimum 16, 52 in average !), but only displaces a few instruction TLB entries (6 in average). For 100 consecutive operating system invocations, the minimum cumulated number of displaced blocks always exceeded 4,500 for the data TLB, but only 600 for the instruction TLB.

L2 caches The UltraSparc II processor is used in conjunction with a 1 Mbyte L2 cache featuring 64-byte blocks.

In the vast majority of cases, an operating system invocation displaced between 850 and 950 blocks. The minimum cumulated number of displaced blocks for 100 operating systems invocations always exceeded 95,000.

Summary:
On a Sun workstation featuring an UltraSparc II and Solaris, the five considered memorization structures are subject to lose a significant amount of volatile non-architectural hardware information on operating system invocations.

While the numbers presented here  are only valid for this platform, the same conclusion will prevail for other processors and other operating systems for PCs and workstations.Moreover, other processors (e.g Alpha 21264, Pentium III) feature more complex branch prediction mechanisms that are even more affected by the operating system than the ones on the UltraSparc II.
 

 Back to HAVEGE main page