...: Linux Kernel Development by Robert Love

Memory

Memory Management Unit (MMU) is hardware that manages virtual to physical address translation, memory protection and cache management. The smallest amount of memory with which an MMU deals with is known as Page. Usually a page is 4KB on a 32-bit machine.

The kernel represents every Page with a 'struct page' structure. This allows is to track if a page is free and if not who owns the page. Owners can be user space processes, dynamically allocated kernel data, static kernel code, page cache, etc.

If a system has 4GB physical memory and the page size is 8KB then that is 524,888 pages -
4GB = 4 * 2^10 * 2^10 KB
4GB / 8KB = 1^10 * 2^10 = 524,288 pages
If the size of page struct is 40 bytes then the total memory used by these pages is -
524,288 * 40 = 20MB

Process -> Memory Access using Virtual Memory Address -> MMU -> Maps to Physical Memory -> Memory

Not all pages are identical i.e. some pages cannot used for certain tasks because their physical address is reserved by the kernel. Pages are divided into different zones which are groups with similar properties.

There are two hardware limitations with respect to memory addressing -
- some hardware devices can perform DMA to only certain memory addresses.
- some architectures can physically address larger amounts of memory then they can virtually address, thus some memory is not mapped into kernel address space.

There are 4 primary memory zones (defined in linux/mmzone.h) -
- ZONE_DMA - pages that can undergo DMA.
- ZONE_DMA32 - same as above, but only accessible by 32-bit devices.
- ZONE_NORMAL - contains regularly mapped pages.
- ZONE_HIGHMEM - pages which are not permanently mapped into kernel address space.

On x86 32-bit systems, ZONE_DMA is 0 to 16MB, ZONE_NORMAL is 16 to 896MB and ZONE_HIGHMEM is memory above 896MB. On 64-bit systems there is no ZONE_HIGHMEM, all memory from 16MB onward is ZONE_NORMAL.

When memory allocation requests come in, the kernel allocates memory from ZONE_DMA if the request is from DMA-able memory and from ZONE_NORMAL if request is for normal memory. If there is a memory crunch the kernel can allocate normal memory from ZONE_DMA.

The kernel allows allocating pages i.e. in multiples of 2^x pages where x = 0 is 2^0 = 1 page, x = 1 is 2^1 = 2 pages, x = 2 is 2^2 pages, x = 4 pages and so on. The kernel allocates contiguous physical pages. It returns NULL if allocation fails.

The kmalloc() function allows allocating memory in chunks of bytes instead of pages. It may allocate more than whats requested to satisfy other allocation optimizations, but never less. The caller has no way to determine if more memory was allocated. It returns NULL if allocation fails.

Get Free Pages (GFP) flags are passed in allocation requests with kmalloc(), they specify Action, Zone and Type Modifiers.

Action Modifiers - specify how the kernel should allocate memory e.g. interrupt handlers must instruct the kernel not to sleep when allocating memory.

Zone Modifiers - specify which zone to allocate the memory in.

Type Modifiers - are a combination of Action and Zone modifiers. They are simpler and less error prone.

vmalloc() - similar to kmalloc() except it allocates contiguous virtual addresses although the physical pages may not be contiguous. kmalloc() ensure contiguous physical pages. vmalloc() takes longer than kmalloc() since it looks up the non-contiguous physical page addresses in the page table entries and thus takes longer. vmalloc() may sleep because of which a lot of kernel code calls kmalloc() even if it doesn't need contiguous physical space.

Kernel maintains a cache of frequently allocated and released objects. One cache per object type. Each cache is divided into slabs. Each slab is one or more physical contiguous pages. The objects are allocated on these pages. A slab is either Full, Partially Full or Empty.

Slabs can have specific attributes applied when they are created -
- align with hardware cache i.e. each object is within cache line boundaries to prevent false sharing
- poisoning - prefilled values to detect uninitialized access.
- red zone - buffers around each object to detect buffer overruns.
- panic - when allocation fails and isn't expected to, the slab layer should panic
- dma - allocate slab in DMA-able memory

Todo - read more on slab attributes

Per-CPU Data - a programming technique where data is allocated for each cpu and programming convention is used such that data specific to the cpu is accessed by the thread running on that cpu. Though its possible to cheat. Per-cpu allocation removes the need to lock and cache invalidation.

Todo - read more about Per-CPU data

Virtual File System

VFS is an abstraction layer over different file systems. It provides a consistent layer to the kernel t issue filesystem operations.

write(fd, 100, ...) -> sys.write() -> fs.write() -> ((media))
User Space App -> VFS -> file system specific -> ((media))

Unix has 4 file system abstractions - files, directory entries, inodes and mount points.

File systems are mounted at a specific mount point in a global namespace - thus all file systems appear as entries in a single tree e.g. /mnt/raspberry/local & /mnt/raspberry/FAT.

DOS breaks file systems by drive i.e. C:\, D:\, etc.

In Unix directories are files which list files contained in them. Thus the same read, write, create and delete operations that apply on files apply to directories.

Information associated with files such as permissions, owner, creation time, etc. are stored separately in INodes - index nodes. Information associated with the file system in stored in the Super Block.

The Linux VFS is designed to work with file systems that support or resemble these concepts - files, inodes, super block, mount point.

There is also a concept/structure DENTRY used to denote all component of a path. /bin/vi has three dentry components - /, bin, vi. First two are directories and the third is a file. dentry objects are created on the fly and are used for path related operations such as lookup. As these path related operations can be string heavy, linux optimizes by creating dentry objects.

dentry objects are cached so that future path lookups can be completed quickly. An object can be in one of these states -
- 'used' implies that it points to a valid inode and is in use.
- 'unused' implies that it points to a valid inode but is not in use.
- 'negative' implies that it is pointing to an invalid inode. Negative dentry objects are cached to quickly identify invalid paths, non-existent files, etc so that repetitive attempts to access them can be quickly returned a failure error.

dentry objects are cached in dcache. Usually dentry lookups are spatial and temporal i.e. same file may be looked up more than once (temporal) and files in the same sub-directory are likely to be accessed if one of them has been accessed (spatial).

File object is used to represent a process view of a file. Multiple file objects can be created by one or more processes which all point to the same file. A file object points to a dentry object which contains the inode object which points to the file. dentry and inode objects are unique.

Operations such as read, write, open, close are applied on the file object.

Every process has its list of open files, root filesystem, current working directory, mount point, etc. This info is sorted in three data structures files_struct, fs_struct and namespace.

files_struct contains the list of open files for a process. The macro NR_OPEN_DEFAULT is usually set to 64 when building linux kernel and thus files_struct contains an array of size 64 which points to up to 64 open files per process. If a process opens more than 64 files another array in the struct is allocated to hold pointers to hem. A static array of 64 allows fast lookup of open files. If its known that a lot of processes will open more than 64 files then redefine the macro and rebuild the kernel.

fs_struct holds current working directory and root directory of the process.

namespace holds the view of the filesystem. This is usually the same for all processes however the kernel allows defining per process namespaces.

Block I/O

Block Devices - HDD, DVD, Flash Memory, etc are hardware devices which allow random and not necessarily sequential access of chunks of fixed size data known as blocks.

Character Devices - accessed as a stream of sequential data, one byte at a time e.g. keyboards. There is no random access.

Block I/O Layer - an entire subsystem to manage block devices where char devices don't need this special attention.

Sector - smallest addressable unit on a block device. Usually in powers of 2, with 512 as the common size. Device can work with multiple sectors at a time but not less than one sector. Sector is a hardware property.

Block - filesystems use blocks as a concept to access the devices. A block can be one ore more sectors but not less than one sector. Further the kernel requires the block size to be less than or equal to the Page size.

Sector <= Block <= Page

Buffer and Buffer Head - when a block is stored in memory it is stored as a buffer. Each buffer has an associated buffer head which is a structure that acts as a header describing the state of the buffer e.g. is it up to date, is it dirty and needs to be written back to disk, etc.

Large block I/O operations are broken down into multiple buffers and their buffer heads - which is an unnecessary overhead.

Block I/O (BIO) Structure - is an improvement over buffers. A bio struct contains a list of bio vectors. Each bio vector points to a page in memory. Each page is one ore more blocks. Thus bio structures can represent multiple pages versus buffers which can represent a single page. This performance improvement is used by RAID (Redundant Array of Inexpensive Disks) setup to backup multiple disks.

Buffer Head structures are still used to map a block and page and state of a buffer. Bio structures are used to track an in-flight I/O operation. The two uses are kept in separate structures.

Todo - need to understand Buffer and BIO in more detail.

Bio structures represent contiguous chunks of blocks. However an I/O operation can involve multiple discontiguous chunks of blocks. In this case there are multiple bio structures for each set of contiguous blocks.

Each block device maintains a doubly linked request queue, for all pending I/O operations. The device driver picks up a request from the queue and submits to the device. Each request consists of one or more bio structures.

I/O Schedulers

I/O schedulers manage a block device's request queue by merging and sorting the requests on the queue to minimize the disk seeks, and thus improve the global throughput. The scheduler may be partial/unfair to certain requests so that the disk seeks are minimized and overall performance of the system is improved. I/O schedulers merge adjacent or same sector seek requests and sort requests so that the seeks are sequential. They are also termed as I/O elevators.

The first I/O scheduler in linux was called the Linus Elevator! It was replaced in v2.6.

Read vs Write Requests - An application requesting a write request may do this asynchronously i.e. after submitting the write request it may move on to other tasks. Whereas an application submitting a read request is likely to block till it receives the data and thus Read requests need to be served with urgency over Write requests without starving the write requests.

Deadline I/O Scheduler - a scheduler that gives a higher priority to read requests vs write requests to minimize starvation.

Anticipatory I/O Scheduler - builds on top of Deadline I/O scheduler. After a read operation it waits 6ms before it returns to its previous operation in anticipation of another adjacent operation thus minimizing disk seeks. Like the deadline scheduler it gives reads higher priority over writes.

Complete Fair Queuing I/O Scheduler - The CFQ I/O scheduler maintains a request queue for each process, round robins over all the queues and submits a predefined number of request form each queue (usually 4 requests). This is the default I/O scheduler on linux for desktop workloads. However it performs well in all other scenarios as well.

NOOP Scheduler - performs no sorting, it merges adjacent requests and submits the requests without any preference for reads or writes. It is useful for devices where seek time is not an issue such as flash memories.

Different schedulers can be enabled at boot time with the elevator=xyz option where xyz can be deadlines, as, cfq or Noop (CFA is the default).

Linux Kernel Development by Robert Love - 3