Non-Uniform Memory Access

Collection of Terms

Cells - a numa system consists of multiple sockets, where each socket contains one or more cores, local memory and op tionally IO buses. A socket is also referred to as a cell at a hardware level.

Interconnect - cells in a numa system are connected to each other through an interconnect system. Crossbar and Point to Point are common interconnect types.

ccNUMA - is Cache Coherent NUMA system, where memory connected to a cell is visible to all other cells and cache coherency is managed by the interconnect system.

Memory access time - for memory local to a cell the access time is faster when compared with access to memory connected to a different cell. Access to memory on a different cell requires acces to the interconnect and the memory controller on the remote cell, both of these can become a source for contention.

Signal path length - distance between memory and cell, the memory may be local, on the adjacent closest cell or on any other cell which is at a longer distance. The interconnect system may connect all cells to each other or the traversal may require hopping through cells.

numactl --hardware - shows the number of nodes and cost of access across nodes i.e. SLIT (system locality information table).

Nodes - linux maps nodes to physical cells i.e. nodes are software abstraction of cells. Access to local memory is fastest, followed by memory on nodes that are closer. Solaris refers to nodes as locality groups.

Hidden nodes - linux can hide nodes which represent cells with no attached memory and reassign the cores on the hidden node to another visible node. These cores will experience higher memory access times compared to cores with local memory.

NUMA Emulation - linux can emulate nodes by carving existing nodes. An emulated node will manage a portion of the cells memory. Emulation is useful for testing numa aware applications on non-numa architectures. Additionally cpusets use this for memory resource management.

Todo - understand linux cpusets and numa emulation.

Overflow/Fallback - for each node, linux constructs data structures to manage its memory, free pages, in-use pages, memory usage, etc. Each node's memory has its own set of zones i.e. DMA, DMA32, Normal, High memory, Movable. When a zone cannot satisfy a memory allocation request, it is known as overflow/fallback.

Local Allocation - memory allocation request are attempted to be fulfilled from the local memory of the node and from the zone from which the memory is requested for.

Zonelist - an ordered list of zones and nodes is maintained by the kernel to move on to the next zone/node from where the allocation may be attempted if the local allocation fails. The order of zonelist is based on the decision to allocate memory from the same node but a different zone or from a different node but the same zone type. The default zonelist order can be overridden using the kernel boot parameter numa_zonelist_order or sysctl. If an underlying cell does not have any attached memory and the system is configured to not hide these nodes, the local allocation requests for memory from these nodes will traverse a zonelist which will point them to the nearest available memory.

Numa Placement - process of assigning memory from the nodes available in the system.

Scheduling Domains - data structures used by the linux scheduler which reflect the underlying numa topology. The scheduler tries to not migrate tasks from one domain to another or to a distant domain.

Managing Local Allocation - a thread can identify its node id using the api's numa_node_id() or CPU_to_node() and then request memory allocation using this node id, so that allocation is forced from the local node. If the allocation fails, the thread can then manage its own fallback plan. If the kernel is configured to not hide memory-less nodes, the threads running on it can use the apis' numa_mem_id() or cpu_to_mem() to identify the memory node closes to them, then request memory allocation from that node and in case of failure follow their own fallback plan.

Memory Policy - numa aware applications can use the memory policy programming interface to request allocations or page migration to enhance their performance.

Cpusets - is an administrative mechanism to restrict the nodes from which memory allocaiton occurs for a set of processes. Cpusets take priority over memory policy when both are applied to a thread i.e. a task cannot ask for memory allocation from a node that is not configured in its cpuset.

Numalink - proprietary interface which connects sockets in multiple chasis.

Todo - read more about performance differences between access to memory for cores within a socket i.e. what is the memory latency for different cores w.r.t. their position on die and distance from the memory controller.

Different OS Approaches to Handling Numa

Ignore - Treat all memory as equal with no numa placement. This approach is useful when the performance difference across various memory nodes is low. The OS doesn't need any modifications for numa vs non-numa hardware. Performance of an application will be non-deterministic, will vary with every run.

Memory Striping - The hardware is configured such that consecutive cache lines in an address space are mapped to different memory controllers i.e. cache lines are interleaved. The OS does not need any numa specific handling. Load is equally spread across all nodes but the interconnect is constantly in use.

Heuristic Memory Placement - The OS is numa aware and attempts to allocate memory for a thread from the node on which the thread is running. The OS assumes that the thread will continue to run on the same node, will not attempt to allocate memory more than what is available on that node and that the application will not attempt to migrate to another node.

Application Control - The OS provides an interface to applications which allows them to control their memory allocation. Applications can choose to allocate memory on a single node or across nodes when the data is shared across threads.

Linux & NUMA

Linux divides memory into zones i.e. Normal, DMA, DMA32, HighMem. In a numa system linux divides memory of each node into these zones i.e. Node 0 - DMA, Node 1 - DMA, etc. This can be viewed in /proc/zoneinfo file. Memory allocation is done from a zone based on the memory policy. Linux memory policies can be specified at the system level, process level or for a range of addresses in a process address space. Latter policies override the former. Two main memory polices are Local i.e. memory allocation occurs from the memory node local to the thread and Interleave i.e. allocation occurs round robin from all nodes to distribute memory accesses and have an even load on the interconnect.

On boot up the system policy is set to Interleave so that system data structures are evenly distributed across all nodes. The policy is changed to Local when the first user space process (init) is started. The allocation details for each process can be viewed in /proc/<pid>/numa_maps.

When a process starts, it inherits the policy of its parent i.e. Local. The linux scheduler attempts to keep the process cache hot during the load balancing by first attempting to schedule the process on processors that share the L1 cache, then on processors that share the L2 cache and finally on processors that share L3 cache. Beyond this, if the situation warrants, the scheduler will load balance to any other processor on the same node and as a last resort move the process to another numa node. When a process is scheduled to a different numa node, it will access memory remotely and thus its performance degrades.

Note - A patch in 2012 was made to reduce this degradation in performance, details at https://lwn.net/Articles/486858/.

As memory allocations are done across zones and nodes, there comes a time when the system cannot satisfy a memory allocation either because the node or the zone-node combination doesn't have sufficient free memory. if the /proc/sys/vm/zone_reclaim is set to 1, the system will do a reclaim pass and free up memory to serve the allocation request. If this value is set to 0, the system will serve the allocation request from another node that has sufficient memory available and the reclaim will be done later when all nodes are low on free memory. Further, when the value is set to 1, the reclaim pass is lightweight i.e. only unmapped page-cache pages are reclaimed.

Todo - understand reclaim of page-cache pages better.

numactl is a command line tool to setup processes on a numa system. It allows restricting the execution of a process to set of cores and memory allocation to a set of nodes. taskset is another tool that has a subset of numactl capabilities, it is a carry forward from non-numa environments. numactl also displays numa related information.

numastat is a tool to display allocations. The numa_miss counter indicates the number of allocations done from a different node in order to avoid a reclaim pass.

/proc/<pid/numa_maps contains information on memory for a process. Anon stands for pages which do not have an associated file on disk. Nx shows the number of pages on node x.

/proc/meminfo contains information about how memory is used in the whole system. /sys/devices/system/node/node<x>/meminfo contains information for a node.

Linux First Touch Policy

When a page is first accessed, it is not in memory and this causes a page fualt, the memory for the pages is then allocated and the Page Table and TLB are updated, the instruction accessing the page is then restarted. Thus up to the point of access there is no memory allocation. The thread accessing the page may have a different memory policy when it started and another policy before it accessed the page, what matters i the policy in effect when the memory was allocated (i.e. first access). This first touch policy refers to memory policy when the first use/allocation occurs.

If the page is used by only one thread then the application of memory policy is straight forward. However, when multiple threads access the (shared) page and they have different memory policies in effect when they first access it the applicable policy is used from the thread which accesses first. This could lead to unwanted results.

As the kernel initially boots up with an Interleaved policy, the shared libraries it loads will have their text segments loaded across the nodes. When init daemon starts the policy changes to Local. The user space applications loaded from this point onwards, if they refer to the same shared libraries, will have to refer to them on the nodes where they have already been loaded.

Todo - this is interesting, read more on policies and shared libraries.

A process can move its memory from on node to another. The virtual address space does not change, the physical location of data is chagned. The /proc/<pid>/numa_maps can be viewed before and after the move to see the changes. Moving memory across nodes can be useful in reducing usage of cross connects. Text pages of shared libraries may not be possible to be moved or may have negative impact if moved. migratepages is a linux command line tool to manually move pages for a process id.