...: Linux Kernel Development by Robert Love

System Calls

System calls are an interface provided by the kernel to interact with the hardware. API provided by C-Library is an interface to these system calls in the kernel.

Application -> C-Library -> Kernel
printf() -> printf() in C-Library -> write() in C-Library -> write() system call

User process invokes an API, which signals to the kernel that it wants to execute a system call by incurring an exception (of type 128 or 0x80 on x86). This causes a switch to kernel mode and the execution of exception vector 128 which is the SysCall Handler. This handler is provided the system call number/identifier in the EAX register. This value is populated by the C-library API. The correct system call is then invoked in kernel mode.

printf() in User process -> printf() in C-library -> setup write identifier in EAX, trigger exception 128 -> switch to kernel mode -> invoke exception handler for id 128 i.e. SysCall Handler -> SysCall Handler reads EAX and invokes appropriate system call in kernel mode

System calls have to be reentrant. If a system call blocks, the user process can be preempted and another user process can be scheduled which may invoke the same system call.

Interrupts

Interrupts allow hardware to send signals to the processor. Multiple hardware devices are connected to the Interrupt Controller chip over separate pins i.e. a device can have exclusive access to the pin or multiple devices can share access to a pin. The interrupt controller multiplexes these signals over to a single line access to the processor.

Note - exceptions occur synchronously with respect to the clock, however interrupts occur asynchronously.

Interrupt handlers are response functions run by the kernel when an interrupt is received. They are part of the device's drivers. Every interrupt line has a unique numeric value assigned which identifies the device (or devices) and its drivers to handle interrupts on the line. Multiple interrupt handlers can be registered for the same interrupt line i.e. the line is shared. On a shared interrupt line, the kernel sequentially executes all registered handlers, each handler should quickly determine if the interrupt was generated by its device and if not than quickly exit.

Interrupt handlers are divided into Top Halves and Bottom Halves. Top half contains time critical response to an interrupt e.g. reading buffers on network cards. Bottom half contains anything that can be delayed e.g. parsing the network packets.

An interrupt handler can disable all other interrupts when it is running. This shouldn't be done unless absolutely necessary.

When executing an interrupt handler, the kernel is in Interrupt Context. This is not associated with any process and thus the handler cannot call any system call that can sleep, otherwise there would be no way to schedule it.

Note - /proc/interrupts file contains statistics on system interrupts.

Note - /proc is virtual filesystem that exists in kernel memory. Referred to as Procfs. Reading/writing to Procfs is done via kernel functions which simulate read/write to a file.

Bottom Halves are implemented as Softirq, Tasklet or Work Queues.

Softirq - are statically define and registered at compile time. They cannot be dynamically registered or destroyed. Kernel imposes a limit of 32 softirqs, though there are only 9. The top halve of the interrupt handler marks it softirq to run before it exits. The softirq is then run at -
- when returning from hardware interrupt code path
- in the ksoftirqd thread
- in code that checks for pending softirqs and executes them
Todo - the last one isn't clear, read more.

Softirqs are used by the most timing critical and important bottom half processing subsystem e.g. networking and block devices.

Softirqs run with interrupts enabled. A softirq cannot preempt another softirq. A running softirq prevents other softirq from running on the same processor. Two or more of the same or different types can run simultaneously on different processors. Global data sharing across instances of softirqs needs to be locked. Most softirqs have per-processor data.

Tasklets - are bottom half mechanisms built on top of softirqs. Multiple instances of the same handler cannot run concurrently on multiple processors. Two or more tasklets can run concurrently on different processors. Tasklets also cannot sleep.

They are represented by two softirqs - HI_SOFTIRQ and TASKLET_SOFTIRQ, the former runs before the latter.

Interrupts are enabled when tasklets are run. As an optimization, a tasklet runs on the same processor that scheduled it i.e. to optimize processor cache usage.

Todo - Read more on tasklets and softirqs can reactivate themselves.

ksoftirq/n - are kernel threads, one per processor, with the lowest possible priority i.e. nice value = 19. Kernel does not process reactivated softirqs and tasklets immediately. Rather when threads are scheduled they run in a tight loop and schedule any pending softirq/tasklet. On a heavily load system this ensures interrupt reactivation does not consume all the processor and on idle systems interrupt reactivations are eventually handled.

Work Queues - are used when bottom halves need to run in a process context and can thus sleep. This is useful when memory allocation needs to be done (as that can lead to a sleep) or for block I/O. Work queues run in a dedicated kernel thread instead of creating threads for every bottom half. These thread are named 'events/n' are are one for each processor. events/n are known as Worker threads.

An interrupt needs to handle its work in its own thread it can create a separate kernel thread.

Kernel Synchronization

Recursive locks - when a thread can acquire the same lock more than once. Linux does not support this feature.

Deadlock - can occur when a thread attempts to acquire the same lock more than once or when two threads hold a lock each and wait for the other thread to release the other lock (deadly embrace). Document lock sequence to avoid this.

Scalability & Contention - Locks that are acquired often or for long time periods are likely to cause contention. When a large number of threads are in contention for a resource it becomes a bottleneck for scalability. Too coarse locking can cause bottlenecks as processor count grows and too fine locking can become an overhead for smaller number of processors. Balance the two.

Spin Locks - on thread acquires the lock and other threads wait for the lock to be freed, constantly checking. The waiting threads don't sleep. They waste CPU cycles waiting for the lock. The lock should b e acquired for short time periods i.e.e less than two context switches otherwise its better to sleep (using semaphores).

Kernel preemption is disabled when a process holds a spin lock.

Note - Interrupt handlers can use spin locks but not semaphores, as they can't sleep. If interrupt handlers use spin locks they should disable interrupts on the current processor and then acquire spin locks otherwise if they get preempted because of an interrupt the new interrupt handler can spin endlessly to acquire a spin lock already held by the previous handler.

Note - always lock data and not code.

Read/Write locks - are concurrent reader locks or one mutually exclusive write lock. A thread can acquire read lock multiple times. If multiple readers are waiting for a lock along with one or more writers and the lock is held by a reader than the kernel implementation will give priority to the waiting readers over the writers. This can cause starvation to the writer.

Semaphores - are spin locks where waiting threads sleep. Counting semaphores allow more than one thread to acquire the lock. As these cause threads to sleep i.e. there is a cost of context switch to sleep state and awake i.e. 2 * context switch. A counting semaphore with count = 1 is a generic mutex.

Reader-Writer Semaphores - are mutexes with count=1, though they only enforce mutual exclusion for writers but multiple readers can acquire the lock.

Mutex - kernel has a specific mutex implementation which is different from a semaphore with count = 1. It has additional constraints i.e. -
- it cannot be acquired in one context and released in another.
- it cannot be recursive acquired.
- a process holding a mutex cannot exit.
- it cannot be acquired by an interrupt or bottom half.

Big Kernel Lock (BKL) - is a global spin lock that was crated during the transition from uni-processor to symmetrical multi-processor. It can be acquired only in process context and the process can sleep while holding the lock, but when the process sleeps it releases the lock and acquires it when its scheduled again. Its a recursive lock, all calls to acquire need to be matched with release calls.

Sequential Locks - are used when there are many readers and few writers. It favors writers over readers. When a writer acquires the lock it increments the sequence number and when it releases the lock it increments again. The sequence starts at zero. Its odd when data is being written and even when data is not being written. Only one writer can acquire the lock. When a reader acquires the lock, the sequence number is checked before and after the data is read, if the sequence number is the same then the data was not written to while being read otherwise the reader loops.

Timers

System timer - is a programmable hardware component that generates an interrupt at fixed time intervals. On x86 this is 100 times per second or every 10ms or frequency of 100Hz. The interrupt handler for this is the Timer Interrupt which updates the system time and performs routine work.

Tick Rate - a high tick rate increases precision as the interrupt is fired more frequently and routine rescheduling is performed more often. However this also brings in an overhead of handling the interrupt. On an idle system a lower tick rate helps reduce power consumption and on a busy system a higher tick rate ensures the processor is utilized effectively.

Jiffies - is a global variable which holds the number of ticks since the system started. A 32-bit jiffy variable overflows in 497 days at 100 ticks per second. A 64-bit jiffy variable will outlast a persons lifetime.

Real Time Clock (RTC) is a hardware component that stores wall clock time. It tracks time even when the system is off. On x86, RTC and CMOS are integrated and powered by a single battery. On boot, the kernel reads the time from RTC. There is no subsequent use of RTC.

The system timer interrupt handler performs the following -
- reset the interrupt
- periodically save updated wall clock time to RTC
- increment jiffies count global variable
- update resource usage count, time values for current process
- run dynamic timers that have expired
- signal that scheduler should run

Timers/Dynamic Timers/Kernel Timers - is a mechanism to schedule a function to execute after a certain time period has elapsed i.e. this mechanism ensures execution after a period of time but not before a certain period. Timers are destroyed after they expire i.e. their time period elapses.

Timers are executed in the System Timer interrupt handler bottom half.

Busy Waiting/Looping is a term used to refer to the idea of wasting an integer multiple of CPU cycles in order to delay the execution of following instructions by a specific amount of time.
while (time_before(jiffies, delay))
;

The jiffies variable is defined as 'volatile' to ensure that the C compiler loads it on each iteration of the loop instead of optimizing the read once for the loop.

Bogo MIPS - Bogus Million of Instructions Per Second - is a measurement of how fast a processor can do nothing! This value is computed at boot and stored in /proc/cpuinfo

For delays less than one tick (i.e. for 100Hz less than 10ms) the kernel provides three functions for delays in milliseconds, microseconds and nanoseconds. These functions make use of BogoMIPS to determine the number of busy loop iterations the processor can perform in a period.

Linux Kernel Development by Robert Love - 2