...: What Every Programmer Should Know About Memory by Ulrich Drepper

Bypassing Cache

It is possible to bypass the cache when reading or writing. This is useful when the memory write will not be used soon e.g. logging library. Its best to write directly to main memory or to disk (using memory mapped files).

Such writes are known as Non-Temporal Access (NTA). Intel Streaming SIMD Extensions 2 (SSE2) Intrinsics provide this functionality.

What are Intrinsics?

Intrinsics are assembly-coded functions that allow you to use C++ function calls and variables in place of assembly instructions. Intrinsics are expanded inline eliminating function call overhead. Providing the same benefit as using inline assembly, intrinsics improve code readability, assist instruction scheduling, and help reduce debugging. Intrinsics provide access to instructions that cannot be generated using the standard constructs of the C and C++ languages.

Todo - read more about Intrinsics at https://software.intel.com/en-us/node/523351

Optimizing L1 Data Cache Access

Drepper compares two implementations of matrix multiplication. The first approach does a standard iterate over each row of the first matrix and multiply each element with the elements of the column in the second matrix. The second approach transposes the second matrix and then iterates over each row of the first matrix and multiplies each element with the elements in the rows of the transposed matrix. This approach does require additional memory for the transpose matrix.

Since iteration over rows of values, is sequential memory access the L1D cache hits are higher in the second implementation by 76% compared to the first implementation

The third implementation, does not transpose the second matrix, instead it reads entire cache lines instead of individual elements of the matrix. This provides a further 6.1% improvement over the second implementation and does not require any additional memory.

Since Cache Line sizes vary from processor to processor, Dreppers show the use of getconf to retrieve the cache size at compile time and use it to setup a compile time constant.

To list all system configuration variables -
getconf -a

To retrieve L1D cache line size and use it in build time variable CLS -
gcc -DCLS=(getconf LEVEL1_DCACHE_LINESIZE) ...

A fourth approach using SIMD (Single Instruction, Multiple Data) instructions vectorizes the matrix multiplication and improves the performance by 10 times compared to implementation #1.

The intent is to read the cache line before it gets evicted thus saving on cache misses.

When data structures are larger, such that they fit across multiple cache lines the developers need to be cognizant about padding. Design and sequence of elements within the data structure such that there are no holes and the structure can fit into the smallest possible number of cache lines.

Note - Move most used elements to the top of the structure. If possible, access elements of the data structure in the order of their definition.

Each native/fundamental data type has its own alignment requirements. For structures, the largest alignment requirement of its members determines its alignment requirement. A call to malloc() will allocate memory aligned as per the most demanding member. A different memory alignment can be used (to suit the cache line size) by calling posix_memalign() or the structure can be decorated with __attribute((aligned(64))) in its declaration.

When allocating arrays of structures, if the individual elements are aligned along the cache line size, the cost of sequentially accessing (array iteration) is much lower compared to a non-aligned memory access.

Drepper discusses an interesting structure -
struct order
{
double price;
bool paid;
char buyerName[10];
long buyerId;
};
If a frequently running job, calculates the outstanding payments by summing up the price field of multiple order structures, then on every price access, the buyer name and id fields are loaded but not used. This can be avoided if these two fields are moved to a separate structure and referenced through a pointer in the order struct.

This advice is against the principles of good design, which ask to group together related fields in a single structure, however performance benefits outweigh the design benefits.

Optimizing L1 Instruction Cache Access

Code execution is linear until a jump is encountered. Jump instructions disrupt the optimizations possible for the processor for e.g. branch prediction and instruction pipeline. The disruption occurs either because the jump target cannot be determined statically or if its possible statically the memory fetch misses all caches and takes a long time to load from main memory. These are also referred to as execution stalls.

Processors attempt to predict branches using their Branch Prediction (BP) units, which use static and dynamic rules to predict the next set of instructions and attempt to initiate the loading of those instructions into the cache. Since instructions need to be decoded, getting them into the cache is important. L1 instruction cache stores instructions in decoded form and not in the byte code form that they are read in.

Todo - read more about BP units.

L1i cache performance can be improved by keeping the code footprint small as possible, with least number of jumps i.e. without bubbles and aligning the code when its possible.

Compilers can perform various optimizations to the code to bring in some of these benefits. GCC offers the -O2, -O3 and -Os flags to optimize the code at different levels.

Function inlining is useful in removing the jump instructions. Inlining does increase the code size and that could lead to less performance if the code does not fit into the L1 cache. Inlining should be balanced based on the scenario for e.g. if a function is likely to be called multiple times, in a small time frame it might be best to not inline it, as after the first call the code will be residing in the L1 cache for subsequent calls to take advantage of, whereas inlining may have increased the code size to lead to eviction by the time subsequent calls are made.

When a If-Else block has a scenario that one of the two or more code paths is more likely to be executed, then its possible for the compiler to place that code path linearly to ensure that in a majority of the scenario's the cache access hits are gained. GCC provides __builtin_expect for this, Dreppers code sample -

#define unlikely(expr) __builtin_expect(!!(expr), 0)#define likely(expr) __builtin_expect(!!(expr), 1)if (likely(a > 1)){ }else{ }

The -freorder-blocks enables this optimization. This is enabled when -O2 option is used, but disabled when -Os option is used.

Intel Core 2 has a Loop Stream Detector (LSD) feature that detects loops which match the criteria -
- with less than 18 instructions, none of which call a subroutine
- require up to 4 decoder fetches of 16 bytes
- has at most 4 branch instructions
- executed more than 64 times
The loop is locked in the instruction queue and quickly available when the loop is used again. This is useful for small inner loops which are entered many times through an outer loop.

Todo - read more on LSD.

Alignment of instructions is useful when the instructions are aligned to the beginning of the cache line, so that the prefetch of the cache line is most effective. If the instruction is at the end of the cache line, it might lead to the need to read the next cache line, which could lead to a cache line miss.

Instructions can be easy to align when the code is written in Assembly. However gcc provides compiler flags for this -
- -falign-functions=N instructs the compiler to align all functions to the next power of two boundary greater than N i.e. a gap of upto N bytes is created.
- -fno-align-functions turns of alignment, this is useful for code that is executed rarely e.g. libraries.
- -falign-jumps=N aligns beginning of basic blocks reached only through jumps.
- -falign-loops=N aligns beginning of loops.
- -falign-lables aligns every single label in code i.e. beginning of each basic block.

Note - instruction decoding happens in 16 byte blocks which is smaller than the cache line size. A block cannot span a cache line boundary. However alignment at cache line beginning provides best prefetching benefits.

Optimizing Level 2 and Higher Cache Access

L2 and higher caches are usually shared by multiple cores or hyper-threads, thus the effective cache size available to each execution unit is less than the total cache size. Thus reducing the working set size of data that is reused often, helps improve performance if cache misses can be minimized. Data that is used once, does not require any optimization.

Large working sets lead to larger cache misses. If the total working set exceeds the sum of all cache sizes, the developer should optimize the access to data such that once the cache line is evicted it isn't referred again. When the working set fits within the cache size, its important to still reference data in a similar manner so that evictions at L1 cache don't hamper performance. Overall the intent is to minimize the impact of the cache miss penalty.

Programs can be dynamic i.e. they can adjust their data usage based on the cache size by querying the /sys/devices/system/cpu/cpu*/cache for availability of various cache levels and their sizes. Keep in mind multiple processes will be fighting for space on the shared caches. Further across generation of processors the cache sizes are likely to grow.

Optimizing TLB Usage

TLB misses occur when the TLB cache is flushed or if the virtual address space is wide. The TLB flush can be reduced by ensuring fewer context switches for e.g. by pinning cores to processes or reducing the number of processes. Virtual address space can be narrowed by turning off Address Space Layout Randomization (ASLR), a compile time security flag which prevents exploits.

Prefetching

Prefetching can be triggered by certain events (hardware prefetching) or through explicit requests (software prefetching). Prefetching helps to hide the latency of memory access, this is different from Command Pipeline and Out of Order Execution capabilities of processors which hide latency when there is a cache hit.

Hardware Prefetching

Hardware prefetching is triggered by two or more cache misses in a certain pattern. Old implementations recognized only adjacent cache line misses, however more recent processors recognize any two cache misses (even strides). The minimum number of cache misses is set to two, since random memory access patterns (as in accessing global variables) should not trigger prefetching.

Processors expect more than one stream of memory accesses, and assign each cache miss to a stream. If the threshold is reached it leads to prefetching. Up to 8 or 16 streams are tracked for higher level caches. Each cache has its own unit for recognizing patterns in cache misses. L1d and L1i can have one unit, L2 and higher caches have a separate unit. The prefetch unit is for the cache and thus is shared by all threads/cores or processors that share the cache, thus the effective number of streams is reduced.

Prefetching cannot cross page boundaries i.e. the minimum unit size of memory the Memory Management Unit deals with.

source: https://gabrieletolomei.files.wordpress.com/2013/10/mmu.jpg

If the prefetcher were to cross page boundaries, that could lead to a page fault, and the operating systems' page supervisor would have to be called in, which could decide that the requested page is not in the process' memory and thus signal a segmentation fault!!

Note - OS page supervisor is a software component and the prefetcher is a hardware component.

Software Prefetching

Intel provides intrinsics, which are special instructions to enable software based prefetching.

#include <xmmintrin.h>
enum _mm_hint
{
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
_MM_HINT_NTA = 0
};
void _mm_prefetch(void *p, enum _mm_hint h);

The data pointed to by the pointer p (passed to _mm_prefetch function) is prefetched and depending on the hint h is placed in L1d, L2 or higher cache. The T0 hint places the data in L1d and all higher caches (if they are inclusive). The T1 hint places the data in L2 and all higher caches (if they are inclusive). The T2 hint places data in caches above L3 (if they exist). The NTA hint (Non Temporal Hint) tells the processor that this data is used temporarily and should not placed in the caches (i.e. avoid polluting them), the processor can then avoid putting it in lower caches (if inclusive).

AMD provides the prefetchw instruction on its Opteron processors. This instruction prefetches into L1 and also marks the cache line as 'M', which allows the processor to write to the cache line quickly. Otherwise, the default behavior is to set the cache line to 'S' in case of multiple processors with caches and a contention when the write needs to happen.

GCC provides the -fprefetch-loop-arrays switch which is useful when iterating over an array in the loop. The switch is useful when array is large, for small arrays this can be disadvantage. Refer to gcc documentation for the latest on this switch.

Speculative Loading

Some processors reorganize the sequence of instructions i.e. they may execute instructions out of sequence if they are independent or their order does not affect each other.

Todo - read more on which processors support this and benefits of this technique.

Hyper Threads to Prefetch

Hyper Threads are usually not useful as they cause cache pollution and degrade performance overall. However, there is a specific scenario where they can be useful. Say a thread iterates over a data structure for doing its work. This main thread can be associated with a helper thread which prefetches the next few elements for the iteration into the L2 or L1d caches. This helper thread is a sibling hyper thrad of the main thread, so that the prefetch loads data in the shared cache.

This requires that both threads are scheduled on the same core i.e. their affinity is set appropriately. Further, if the size of data structure is small such that it fits in the L2 cache then the helper thread causes a performance degradation, whereas if the data structure is larger then L2, the helper thread actually contributes to performance improvement. Drepper shows data for this experiment.

Todo - see libNUMA to understand how to retrieve information on cores, processors, etc.

Direct Cache Access

Usually Network Interface Cards read data from the network, copy it directly to memory using DMA (direct memory access), the processor is then interrupted and the device driver attempts to read the memory - this initially causes cache misses until data is loaded into cache from main memory. Intel CPUs have a workaround, they load the network packet payload in memory and the packet header directly to the cache i.e. the network card uses direct cache access to write the header. This allows the device driver to read the header from cache and then decide what to do with the payload.

Todo - see Intel's paper on this http://web.stanford.edu/group/comparch/papers/huggahalli05.pdf

What Every Programmer Should Know About Memory by Ulrich Drepper - 3