PhD overview


Note: to speedup page uploading, graphics were replaced by icons. If you wish to view a graphic, click on the icon. Full graphics file sizes are between 10 and 18KB.

Research topic


Framework

My Ph.D. work is focussed on memory layout optimization techniques for data and instructions to improve the performance of the memory hierarchy.

With current technological trends, CPU speeds increase faster than memory speeds. Caching hierarchies are inserted between the CPU and memory to compensate for the low memory speeds. As the speed gap increases, caching has to become more and more efficient.

Architects build ever more complex caching hardware to minimize cache misses and maximize bus bandwidth. Multi level cache hierarchies, non-blocking caches, new DRAM interfaces (EDO-DRAM, SDRAM, RAMBUS...) are solutions among others. This problem is likely to worsen, because DRAM storage will inherently stay slow: as program memory consumption increases between 50% and 100% every year, DRAM architects must privilegiate storage over speed.

It has been shown that cache space is greatly under-exploited. This means that the effort spent to build larger, more efficient caches is partly wasted. To correct this problem, we must adapt the program's memory access patterns to fit the memory hierarchy. Two solutions are possible, code transformations to change the access pattern, and data layout techniques to reorganize data in memory.

This problem is exacerbated in shared memory machines. Such machines are used to run scientific applications, so locality optimization techniques have been studied for numerical arrays. Loop transformation techniques have been extensively studied. They adapt the memory reference pattern to the machine's memory organization. Data layout techniques on the other hand have been little explored, because they are quite limited in scope, since array memory layout constraints are very stringent.

General applications use different data types, so numerical array optimizations are of little use. They most often use heterogeneous data. Such data is not necessarily organized in arrays, so code transformations developed for arrays cannot be used. Memory accesses are therefore more random. On the other hand, layout constraints are much weaker, so data layout techniques have much more potential to improve locality. That is why we study data layout techniques to improve cache performance of general applications. Little research effort has yet been done on this topic.


Advantages of software cache optimization

By modifying the address of data in memory, we implicitly control the replacement policy of the cache. Since the cache is controlled by the data addresses, there is no need to add special instructions to control the cache. Therefore there is no code bloat, and the layout solutions can be used on current processors.

Memory blocks can be reorganized in memory to reduce conflict misses in direct-mapped caches. Blocks frequently used simultaneously can be mapped to distinct cache lines. The difference between using data layout and a fully-associative cache is that while a fully associative cache selects dynamically blocks to replace using parcellar information on past references, data layout uses more comprehensive but static information on the program to determine in which line a block must go.

Caches access memory by blocks. Data layout on the other hand can be used to redistribute words between different memory blocks. By reorganizing words in memory, layout techniques can reduce capacity misses.

To perform data layout, it is necessary to capture the program's locality properties (i.e. understand how the program uses the data in memory). Layout optimizations must also take into consideration data layout constraints generated by the compiler (segregation of data allocated in the stack, the heap or statically) and the language (declaration of arrays, structures, classes or scalars).


Miss classification

The most widely known cache miss classification is that of M. D. Hill, called the 3C's miss classification. Sugumar has refined it by differentiating conflict misses in replacement and mapping misses in set-associative caches. These classifications are developed to analyze hardware performance. We refine these classifications to adapt them to software layout optimizations. We differentiate capacity misses into fundamental and distribution misses.

We change the way cold and capacity misses are computed from Hill's classification:

The full classification is shown in the table below. We use the following abreviations:

Miss type Miss subtype Miss ratio formula
Cold miss
Minimum number of misses possible with an infinite cache
  (B / LS) / R
Capacity misses
Minimum number of misses possible with a perfect cache (fully associative cache with Min policy)
Fundamental misses
Minimum number of misses when data layout is optimal

Distribution misses
Extra capacity misses due to sub-optimal data layout

(Famin1/Line Size) - (B / LS) / R

Famin - (Famin1/Line Size)

Conflict misses
Extra misses caused by a sub-optimal utilization of cache space
Mapping misses
Extra misses caused by poor block layout causing blocks to interfere in the same set

Replacement misses
Extra misses caused by a sub-optimal hardware replacement policy

Samin - Famin

T - Samin

The figure below shows how misses are distributed for the SpecInt'95 Go benchmark on a 32K cache. Hill's classification "overestimates" capacity misses (a). This is corrected by using Min policy. Sugumar's classification is shown in (c), and Our classification in (d). Besides the lowest 0,39%, all the miss classes can be influenced by data layout. Hardware solutions can only improve conflict misses, leaving the bottom 1,89% misses. Here is another sample analysis possible (18KB) [Miss LS Icon (300B)] showing miss ratio function of line size. More results can be seen in 3D on my Mathematica/LiveGraphics3D page (uses Java).

(click to get the 18KB figure)
[Miss Classification Icon (650B)]

Large programs tend to generate a lot of conflict misses and a lot of distribution misses. Data layout techniques can be used to reduce these misses. Replacement misses are indirectly reduced because with fewer capacity misses, the replacement policy is less frequently solicited, so it has fewer chances of making mistakes. Block layout can also prevent all types of conflict misses.

People working on loop transformations have developed a different miss classification, because the 3C classification was not suited to software optimizations (capacity misses is used for reference as the minimum miss ratio attainable, which is not true with software optimization). However, their classification is only valid for arrays referenced within loop nests. Our classification provides a single miss classification valid at the same time for software and hardware cache optimization analysis.


Measuring locality

To evaluate locality between references to distinct data elements, we developed a correlation metric.

In a nutshell, we evaluate the correlation between two data elements by counting the number of references to these elements that happen when they are used together, subtracted by the number of references that happen when they are used separately. Two elements are used together if they are both referenced within a time window we specify.

A high correlation indicates that these elements must fit together in the cache without conflicting (either stored in the same memory block or in memory blocks mapping to different cache lines).


Scalar data layout

Data elements are indivisible and are referenced by the CPU as a whole. Data elements can be bytes, words, or doubles... We have developed a data layout heuristic that is capable of selecting an efficient memory layout for data elements without any compiler layout constraints, for direct-mapped caches.

(click to get the 10KB figure)
[Performance of Scalar Layout Icon (423B)]

The above figure shows miss ratios obtained by changing the addresses referenced in program traces using the scalar heuristic. Although the results show noticeable cache miss reduction compared to the original layout, the technique is insufficient to optimize real applications. The heuristic works for scalars without any layout constraints (that is why we used program traces). There are usually few scalars referenced in memory, because they can usually be assigned to registers. We must therefore propose layout techniques for all data types encountered in applications.


Array layout

Data layout techniques are quite limited for arrays and have already been investigated. The compiler must always be able to compute the address of an array cell from its indexes. Therefore cell layout constraints are very tight.

Two layout solutions are commonly available:

[Interferences Depending on Leading Dimention Icon (455B)]

Padding is often used to fix ping-ping effects that happen with arrays that have power-of-two dimensions (see above graph). LSP can be used to perform a coarse optimization. When the leading size set to be different from a power-of-two, memory references are randomized at the same time within an array and between arrays (array sizes change, so base address of arrays allocated contiguously change too). IAP can be used to take advantage of power-of-two dimensions to eliminate cross-references. Base addresses are shifted so that instead on systematically conflicting, references to distinct addresses systematically hit distinct cache lines. By combining intelligently IAP and LSP, we can also eliminate self-interference.

Finer padding optimization can also be used to generate cache misses more uniformly during the loop nest execution. This is useful for non-blocking caches, to prevent saturation of the pending miss management mechanism. It is also possible to interlace arrays of same dimention which interfere whithin a loop nest. This is rarely used, because the layout is static, and arrays are likely to have different access patterns in different loop nests.


Data structure layout

General applications often store data in data structures instead of arrays. Structures can be organized as chained lists, trees or arrays of structures. The fields within a structure do not have the same reference patterns. Some fields will be used frequently while some others are referenced only on rare occasion. We propose layout techniques to reorganize together frequently used fields.

To support instance interleaving, we developed an optimized allocation library. Allocation libraries have been found to often have poor locality properties. Our library was designed to have good locality. Furthermore, it often consumes less memory than standard allocation libraries, and generates much fewer system calls.

(click to get the 17KB figure)
[Speedups with Structure Layout Icon (649B)]

In our experiments, shown above, we found that data structure layout can provide noticeable speedups for large applications that make heavy usage of data structures, especially if their performance is hindered by the memory hierarchy. We are also investigating the combination of data structure layout and prefetching.


Instruction layout

Many hardware solutions are being investigated to obtain an uninterrupted flow of instructions. This is necessary to increase superscalar CPU width. Although the instruction stream has much better locality properties than the data stream, software instruction layout solutions have been shown to be very effective, often with speedups of 10% or more for optimized instruction layout.

The numerous studies on instruction layout propose techniques that can be summed up as shown below:

Many studied have been done on instruction layout alone. We therefore investigate coupling instruction layout and software instruction prefetch, to see if such an approach can eliminate I-cache misses. To achieve this, we are building a tool that generates a modified assembler code. We use Salto to parse and regenerate the assembly code of programs. Routines that remove fluff code and insert prefetch instructions are built on top of Salto's routines. All these optimizations are done statically without any help from profiling. They only use inter- and intra-procedural analysis.


Publications

The following references can be downloaded from the download page.

Accurate Data Layout Into Blocks May Boost Cache Performance.
Dan Truong, Francois Bodin and André Seznec
Workshop on Interaction between Compilers and Computer Architecture (Interract-2), San Antonio, Texas, Feb. 1, 1997

Improving Cache Behavior of Dynamically Allocated Data Structures.
Dan Truong, Francois Bodin and André Seznec
International Conference on Parallel Architectures and Compilation Techniques (PACT'98), Paris, France, Oct. 12-18 1998.

Considerations on Dynamically Allocated Data Structure Layout Optimization.
Dan Truong.
First Workshop on Profile and Feedback-Directed Compilation (PFDC-1), Paris, France, Oct. 12 1998.

Also available:

Ialloc, a dynamic allocation library for data structures written in C. It is often faster than malloc, and it facilitates data layout optimizations. PACT'98 and PFDC-1 optimizations are performed using Ialloc and are compared to the standard malloc librairies.
last modified July 10, 1998