...: What Every Programmer Should Know About Memory by Ulrich Drepper

Concurrency Optimizations

Drepper shows data for experiments where multiple threads (of the same process) accessed read data and write data. If the threads are scheduled (or pinned) to the same processor, data is most likely cached for all cores in the shared L2 cache and individually for each core in the L1d cache. Read only data is shared i.e. cache lines containing this data are marked 'S'. However, write-able data causes contention as each thread will cause RFOs to be sent out to get exclusive access to the caches.

If read only data is marked 'const' the compiler will place the variables in .rodata (read only data) or .data.rel.ro (read only after relocation) sections. When const variables are placed in a different section, the possibility of them sharing the cache line with a write-able variable reduces largely.

Variables can be qualified with the __thread keyword, which marks them as Thread Local Storage (TLS). These variables are then accessible only by the thread creating/owning them and thus will not end up on a shared cache line with variables accessible to other threads. However the owner thread can pass pointer to these variables to another thread to provide access. TLS variables require special handling i.e. when the thread is created, these variables are initialized and managed with certain costs involved. If a thread does not access any TLS variable, they are not initialized. If a thread accesses any one TLS variable, all its TLS variables are initialized.

Atomic Operations

Most architectures provide 4 types of atomic operations -
- Bit Test - These operations set or clear a bit atomically and return a status indicating whethr the bit was set before or not.
- Load Lock/Store Conditional (LL/SC) - These operations work as a pair, the load instruction is used to start a transaction and the final store will succeed if the location has not been modified in the meantime.
- Compare and Swap (CAS) - This is a ternary operation which compares the value at an address (first parameter) with the second parameter and if they match it writes the third parameter to the address.
- Atomic Arithmetic - These operations are available only on x86 and x86_64 processors. These can perform arithmetic and logic operations on memory locaitons.

LL/SC and CAS are equivalent, thus architectures support only one of the two. CAS is preferred.

CAS logic -
int curval;
int newval;
do {
curval = var;
newval = curval + addend;
} while (CAS(&var, curval, newval));
If the return of CAS is true, it succeeded in updating the value at &var.

LL/SC logic -
int curval;
int newval;
do {
curval = LL(var);
newval = curval + addend;
} while (SC(var, newval));
LL(var) is a special load instruction. Call to SC does not require the address of var since the processors know if this memory location has been modified in the meantime.

Atomic Optimizations

Since CAS is more popular and universal on all architectures, it is most commonly used for all atomic operations. However CAS has a higher penalty compared to alternative atomic operaitons. Thus if programs are forced to use atomic operations other than CAS (Or LL/SC) the performance gains are large.

Todo - see gcc __sync__ primitives to understand this better.

Cache Affinity

If two threads of the same process, share the same data set, are scheduled on different cores then there are high chances of them being in contention i.e. when writing to shared variables they will be fighting with each other to mark their cache line 'Exclusive' and to update the cache line after the other thrad has written to it.

If multi-processor architecture (NUMA), this problem extends beyond L1d and l2 and shared L3 cache if the threads are scheduled on different processors.

Similarly if two unrelated threads (of different processes) are scheduled on the same core they will fight for the l1d and L1i cache to bring in their own instruction set and data set.

The obvious solution is to ensrue that threads are scheduled so that their negative affects on cache/memory usage is reduced. The linux scheduler does not handle these situations automagically. The appropriate technique is to use thread affinity. The sched.h and pthread.h headers provide various calls and macros to assign core/processor affinity to threads.

Todo - read more about sched.h, pthread.h and /proc/irq/default_smp_affinity for thread/core affinity setup.

What Every Programmer Should Know About Memory by Ulrich Drepper - 4