Linux Kernel Overview

This section gives some background to the rest of the document; a very brief overview of the Linux kernel and the key concepts needed to understand its fundamental operation.

Introduction

The kernel can be seen as the heart of an operating system, loaded into RAM at boot time and remaining present until power down, it has two main responsibilities:

  • To service low level hardware programming requirements (e.g. responding to hardware interrupts).

  • To provide an environment for processes; instances of programs or threads in execution.

The Linux kernel is said to be monolithic; a single large executable, consisting of several logically divided components.

Kernel Modes

The kernel can operate in one of two modes: user or kernel. Program execution mainly takes place in user mode, with no direct access to kernel data structures or hardware devices. A switch to kernel mode can be triggered by:

  • A program issuing a system call (a library function that makes a request to the kernel).

  • A CPU signaling an exception (an anomalous condition that requires special attention e.g. divide by zero).

  • An interrupt issued to the CPU by a hardware device to indicate that it requires attention.

The kernel spends much of its time in kernel mode running on behalf of a user process. However several threads are executed in kernel mode on behalf of the kernel itself, carrying out "house keeping" activities. Once the pending operation in kernel mode is complete, the kernel switches back to user mode.

Modules

The kernel is capable of dynamically loading additional portions of code (modules) on the fly, to enhance its functionality. Amongst other things, modules can add support for file systems or specific hardware devices. When the functionality provided by a module is no longer required, the module can be unloaded, freeing memory.

Processes

A process is an instance of a program in execution. Every process has:

  • A state, either runnable, interruptible, uninterruptible, stopped or zombie.

  • A context, a snapshot of all CPU registers (PC, SP, PSW, general purpose, floating point & memory management).

  • A process descriptor, a data structure (of type task_struct that holds all the information associated with a process.

The kernel provides a multiprogramming environment; many processes can be active simultaneously. Each process contends for the various hardware resources available. The kernel must ensure that the resources are shared appropriately. Multiprogramming is supported by giving each process in the runnable state queue an opportunity to run on the CPU in turn. The process that "owns" the CPU at a particular instant is referred to as current. The procedure for swapping between runnable processes is termed a context switch. A context switch involves saving the context (a snapshot of the CPU state) of the current process and loading the context of the next runnable process. Context switches can only occur when the kernel is in user mode, so the kernel cannot perform immediate context switches and is termed non-preemptive.

Each user process runs in its own address space, an assigned portion of the total memory available. Address spaces (or parts of) may be shared between processes upon request, or automatically if the kernel deems it appropriate. The separation of the address space of processes prevents one process from interfering with the operation of another or the operating system as a whole.

In addition to the normal user processes running on a system, several kernel threads are created at system startup and run permanently in kernel mode, carrying out various house keeping functions for the kernel.

Synchronisation

The kernel is reentrant; several processes may be executing in kernel mode at one time. Of course, on a uniprocessor system only one process can make progress, with all others blocked and waiting in a queue. Example: A process requests a file read. The Virtual File System translates the request into a low level disk operation and passes it to the disk controller, on behalf of the process. Instead of waiting until the disk operation is complete (many thousands of CPU cycles later), the process voluntarily gives up the CPU after making the request and the kernel allows another waiting process to run on the CPU and make progress in kernel mode. When the disk operation is complete (signaled by a hardware interrupt), the current process gives up the CPU to the associated interrupt handler and the original process is woken up, resuming where it left off.

In order to implement a reliable reentrant kernel, care must be taken to ensure the consistency of kernel data structures. If one process modifies a resource counter "behind the back" of another waiting process, the result could be potentially disastrous. The following steps are taken to prevent this type of occurrence:

  • One process may only replace another in kernel mode if it has voluntarily relinquished the CPU, leaving data structures in a consistent state, hence the kernel is termed "non-preemptive".

  • Interrupts may be disabled during critical regions; areas of code that must be completed without interruption.

  • Use of spin locks and semaphores to control access to data structures.

Semaphores consist of the following:

  • A counter variable (an integer), initialised to 1.

  • A linked list of processes waiting to access the data structure.

  • Two atomic methods up() and down() which increment and decrement the counter respectively.

When a kernel control path accesses a data structure protected by a semaphore, it calls down(), if the new value of the counter is not negative access is granted, if it is negative access is blocked and the process is added to the semaphore's linked list of waiting processes. Similarly, when a process has finished with a data structure, it calls up() and the next process in the waiting list gains access.

Precautions must be taken to ensure deadlock is avoided, where several processes own a single resource, but each is waiting on a resource owned by another waiting process. If this list waiting processes forms a closed circle, deadlock is reached. For an in-depth explanation of deadlock see "The Dining Philosophers Problem" on the Internet or in any computer architecture textbook.

Signals and Inter Process Communication

A signal is a short message, sent between two processes or between a process and the kernel. Two types of signal are used to notify processes of system events:

  • Asynchronous events (e.g. SIGTERM, issued by the Ctrl-C key sequence).

  • Synchronous errors or exceptions (e.g. SIGSEGV when a process attempts to access an illegal address).

There are about 20 different signals defined in the POSIX standard, some of which may be ignored. Some signals cannot be ignored and are not even handled by the process itself.

Linux uses System V Inter-Process-Communication (IPC) which is made up of:

  • Semaphores (requested via semget() system call).

  • Message queues (received via msgget(), sent via msgsnd() system calls).

  • Shared memory (requested via shmget(), accessed via shmat() and relinquished via shmdt() system calls).

Memory Management

Linux uses virtual memory, a level of abstraction between process memory requests (linear addresses) and the physical addresses used to fulfill them. It makes the following possible:

  • Enables processes to run whose total memory requirements exceed the physical RAM available.

  • A continuous address space, independent of physical memory organisation.

  • Demand paging; portions of data or code are only loaded into RAM when required.

  • Shared images of programs and libraries, making memory use more efficient.

  • Transparent relocation of running programs in memory.

The address space is divided into 4kB portions called "pages", which form the basic unit of all virtual memory transactions. Physical RAM is also divided into 4kB portions called "page frames", each of which can store any arbitrary page. Because the total address space exceeds that of the RAM available, only a subset of all the available pages can be held in RAM at one time. However, a page must be present in RAM for it to be accessed as data or executed as a program.

Because any page can be relocated in any page frame, the kernel must keep track of where the used pages are kept. This is implemented using page tables, which are used to convert logical addresses into physical ones.

On Intel x86 hardware, Linux actually uses a two level page table scheme (but uses a three level scheme internally to improve portability) to reduce the amount of memory taken up by page tables. To convert a linear address into a physical one, the tables are consulted in this order: Page Global Directory then Page Table to yield a page number and an offset within the page. Therefore, a linear address can be broken down into three parts: Directory, Table and Offset. Because Linux 2.2 can address 4GB of address space (using 32 bit addresses) and uses a 4kB page size, the 10 most significant bits of an address make up the Directory, the next 10 most significant bits make up the Table (hence identify the page required) and the 12 least significant bits make up the offset from the start of the page.

Virtual File system

The VFS is responsible for providing a common interface to all underlying file systems present on a system; "The Common File Model" which is capable of representing files on any type of file system. The file systems supported by the VFS fall into three categories:

  • Disk based, including hard disk, floppy disk and CD-ROM.

  • Network based, including NFS, AFS and SMB.

  • Special file systems, including /proc and /dev/pts.

The common file model can be viewed as object-oriented, with objects being software constructs (data structures and associated methods/functions) of the following types:

  • Super block object; stores information relating to a mounted file system, corresponding to a file system control block stored on disk (for disk based file systems).

  • Inode object; stores information relating to a single file, corresponding to a file system control block stored on disk (for disk based file systems).

  • File object; stores information relating to the interaction of an open file and a process. This object exists while a process is interacting with a file.

  • Dentry object; links a directory entry (a pathname) with its corresponding inode on disk.

Recently used dentry objects are held in a dentry cache to speed up the translation from a pathname to the inode of the corresponding file. The dentry cache consists of two types of data structure:

  • Dentry objects in the following states: in use, unused or negative.

  • A hash table to speed up pathname to inode translation.

Disk Caches

**FIXME** This section needs updating for 2.4.x kernels.

Linux dynamically sets aside a certain proportion of the available memory for two disk caches; the buffer cache and the page cache. Use of caches increases system performance by minimising the number of time consuming disk access required.

The Buffer Cache

The buffer cache is made up of lots of buffers, each of which refer to a single arbitrary block on a block device. Each buffer consists of a buffer_head data structure and an area of memory equal to the block size of the associated device, used to hold data.

To minimise the CPU overhead involved in maintaining the buffer cache, all the buffers are held in one of several linked lists. Each of the linked lists contains buffers in the same state; unused, free, clean, dirty, locked etc.

In order to gain a significant performance benefit using a cache, the process of checking the buffer cache for a particular buffer must be as efficient as possible. Every time a call to read() is made, the buffer cache must be checked for the required block(s) first. To enable buffers to be found quickly, a hash table is maintained, containing all the buffers present in the cache. The getblk() function is the main service routine for the buffer cache; it performs the functions described above.

The buffer cache can also be used to improve the disk writing performance of the system. Instead of carrying out all writes immediately, the kernel stores data to be written in the buffer cache, waiting to see if the writes can be grouped together to minimise disk head movement. A buffer that contains data that is waiting to be written to disk is termed "dirty". A field in the buffer head data structure indicates if a particular page is dirty or not.

The Page Cache

The page cache is made up of pages, each of which refers to a 4kB portion of data associated with an open file. The data contained in a page may come from several disk blocks, which may not be next to each other on the disk. The page cache is largely used to interface the requirements of the memory management subsystem (which uses fixed, 4kB pages) to the VFS subsystem (which uses variable size blocks).

The page cache has two important data structures, a page hash table and an inode queue. The page hash table is used to quickly find the page descriptor of the page holding data associated with an inode and offset within a file. The inode queue contains lists of page descriptors relating to open files.

The three main service routines for the page cache are find_page(), add_to_page_cache() and remove_inode_page().

Special care must be taken to synchronise the two caches, to prevent processes from receiving stale data. Should the kernel become short on memory, memory can be reclaimed by emptying the disk caches of old, unused data. This task is performed by a dedicated kernel thread.

Your Turn

Head over to **FIXME** lxr and look up some of the functions mentioned in this chapter. Get used to using lxr and moving around the source code. If possible, try and understand the code, but don't panic if it doesn't make a lot of sense at this stage!