Glossary

This part of the document will remain a random to-do list until I figure out how to do a glossary properly in SGML!

Please send suggestions.

Glossary

2Q algorithm

MM algorithm based on two areas, one managed as a FIFO queue, and one as an LRU list.

8259 PIC

Outdated Programmable Interrupt Controller present on Intel hardware.

ABI

Application Binary Interface; the interface of passed structures between the user processes (and libraries) and the kernel. For compatibility, it is important that these remain as static as possible (i.e. making sure that variables and structure members have the same bytesize as before, and in the same ordering). Occasionally breakage is necessary, requiring re-compilation of the user-space sources (note that this does not affect source-compatibility; that is a separate issue).

ACPI

Advanced Configuration and Power Interface - replacement for APM that has the advantage of allowing O/S control of power management facilities.

AGI

Address Generation Interlocking, on x86. When execution of an instruction requires an address resulting from a non-completed instruction, the CPU must wait - this is known as an AGI stall.

AGP

Accelerated Graphics Port, on x86 boxes.

anonymous

Generally, used for something which doesn't have the usual associated object. For example an anonymous address space is not interested in user address space (that is, no process context). Some common ones are:

Anonymous page

A page of memory that is not associated with a file on a file system. This can come from expanding the process's data segment with brk(), shared memory segments, or mmap() with a MAP_ANON or MAP_PRIVATE flag. MAP_PRIVATE, although it maps in data from a file, is considered anonymous because any changes do not get written back to the file (any dirty pages have to be moved to swap if the page is freed from main memory).

Anonymous buffer

The buffer cache contains buffers of data on their way to/from the disk. An anonymous buffer is not associated with a file. One example of this is data from a deleted file - it will not be written to any file, but is kept around until it is flushed.

APIC

See local APIC and IO-APIC.

APM

Advanced Power Management, power management standard superseded by ACPI. APM and SMP just don't mix.

ARP

This is an acronym for the Address Resolution Protocol and this is how a network machine associates an IP Address with a hardware address.

ASN.1

Abstract Syntax Notation, a protocol for structured data, used, for example, in the Q.3 management protocol.

ast

Professor Andrew S. Tanenbaum, writer of MINIX and several essential O/S books.

ATAPI

ATA Packet Interface, used by most CD-ROMs, and other devices.

balancing

Technique used in the VM code, referring to balancing various parameters such as the number of pages currently free, to avoid thrashing and other bad memory capacity artefacts. See zones, kswapd bug.

BAR

Base Address Registers, for PCI devices.

BCD

Binary-Coded Decimal - see a textbook.

bigmem

See highmem.

big lock

kernel_lock, which locks the entire kernel from entry (no other task may run in the kernel code). It is recursive per process and dropped automatically when a process gives up the CPU, then regained on wake-up, in contrast to other spinlocks.

bit error

Used colloquially to mean a single bit error in some memory address. Often due to faulty memory (ECC memory can correct single bit errors). Often results in fake Oopsen, with addresses like 0x0008000. Also seen are values some small offset from zero, plus a bit error, which is where the value passed a NULL check due to the bit error, and then the kernel tried to access a structure member by means of the pointer, leading to the offset.

block bitmap

In UNIX-like filesystems, the usage of disks blocks is recorded in the block bitmap, where each set bit indicates a specific allocated block.

bottom-half handler

A set of standard kernel threads that execute tasks on a queue that have been registered with that type of bottom-half handler for execution. The code is run on return to user space or at the end of a hardware interrupt. In 2.3.43 a more general solution with softirqs and tasklets was implemented. Sometimes abbreviated to "bh", which should not be confused with buffer head, which is also abbreviated to "bh".

bounce buffer

An intermediate buffer. Used for example, in "faking" alignment to a client from non-aligned resources.

brlocks

Big-reader locks, used when there are many contending for read access to a resource, and very few contending for writes (thus the balance is towards very fast read locking, and very slow write locking).

BSP

BootStrap Processor, or the CPU which enables the other CPUs in an SMP system.

bss

Block storage segment. This is the memory mapping section containing the data allocated for a binary image at execution time. Also known as "Block Started by Symbol" and "Bull-Shit Storage".

BTB

Branch Target Buffer - on x86 processors, the cache of recent conditional jump results.

buddy allocator

The memory allocation scheme used in the kernel. A vector of lists of free pages is kept, ordered by the size of the chunk (in powers of two). When a chunk is allocated, it is removed from the relevant list. When a chunk is freed back to the free pages pool, it is placed in the relevant list, starting from the top. If it is physically contiguous with a present chunk, they are merged and placed in the list above (i.e. where the chunks are twice the size), and this operation percolates up the vector. As regions are merged whenever possible, this design helps to reduce memory fragmentation. FIXME

buffer cache

The buffer cache is a hash table of buffers, indexed by device and block number. LRU lists are maintained for the buffers in the various states, with separate lists for buffers of different sizes. With 2.3's unification of the buffer and page caches, each buffer head points to part or all of a page structure, through which the buffer's actual contents are available. FIXME

buffer head

A structure containing information on I/O for some page in real memory. A buffer can be locked during I/O, or in several other states depending on its usage or whether it is free. Each buffer is associated with one page, but every page may have several buffers (consider the floppy on x86, where the I/O block size is 512 bytes, but each page is commonly 4096 bytes).

BUG()

Used in kernel code in tests for "impossible" conditions. Signify a kernel bug or faulty hardware.

bus mastering

Giving a card on a bus (e.g. ISA,PCI) the ability to read/write directly to main memory. This is how DMA is performed on PCI busses.

byte sex

Endianness.

cache affinity

Where the cache of a CPU represents the current memory set used by a task, there is said to be cache affinity with that task. A good thing if the task is regularly scheduled on that CPU. See processor affinity.

cache coherency

On an SMP system, ensuring that the local memory cache of each CPU is consistent with respect to the values which may be stored in other CPUs' caches, avoiding coherency problems such as the "lost update". This is achieved by the hardware in concert with the operating system.

cache line

A section of the hardware cache, around 32 bytes large. Kernel structures are often designed such that the commonly-accessed members all fit into one cache-line, which reduces cache pollution. Structures such as this are cache line aligned.

cache ping-pong

A hardware phenomenon in an SMP system, where two tasks on different CPUs are both accessing the same physical memory in a cache line. This means as each task runs, when it changes the memory, it must invalidate the other CPU's relevant cache line (to ensure cache coherency). Then, when the task on the other CPU runs, it must reload the cache line (as it's set invalid), before changing it. Repeat ad jocularum. A bad thing (TM). A common reason for putting a lock on a different cache line than the data mutexed by the lock : then the "other" task can grab and drop the lock without having to necessarily invalidate the cache line on the first CPU. FIXME

cache pollution

Where during execution of a task, another task is scheduled onto that CPU which disrupts useful lines of the current cache contents, which will be used soon. That is, cache pollution is a non-optimal situation where the other process would have been bettered scheduled on a different CPU or at a different time. The aim is to minimise the need to replace cache lines, obviously increasing efficiency.

call gate

x86 hardware support for mode switch to kernel (i.e. system call). In Linux, int 0x80 will trigger the call gate.

CAP_

These are defined names of capabilities for specific tasks provided by the kernel, e.g. CAP_SYS_NICE.*

chroot jail

A process under the aegis of a chroot() syscall is in a chroot jail, and cannot access the file system above its notion of root directory /.

cli/sti

x86 assembler instructions for disabling and enabling interrupts, respectively. There are CPU-local and global variants of these. Code running with interrupts disabled must be fast, for obvious reasons (this is called interrupt latency).

CML2

Eric Raymond's proposal for a replacement to the current kernel build system. See http://www.tuxedo.org/~esr/kbuild.

cold cache

A cache whose content is invalid or irrelevant with respect to some task to be run.

completion ports

I/O interface used in O/S's such as Windows NT. Userspace notifies the kernel of each file descriptor the program is interested. The O/S uses a callback for each fd to indicate that I/O is ready.

contention

Where two tasks each want an exclusive resource. You may hear talk of, for example, spinlock contention, which is where one or more tasks is commonly busy-waiting for a spinlock to become unlocked, as it is being taken by other tasks.

context switch

Refers to the changes necessary in the CPU when the scheduler schedules a different process to run on the CPU. This involves invalidating the TLB, loading the registers with the saved values, etc. There is an associated cost with such a switch, so it is best to avoid unnecessary context switch when possible. Note that the division of kernel-mode and user-mode means a similar, but simpler, operation is necessary when a syscall moves into kernel mode. However this is not called a context switch, as the mode switch doesn't change the current process. See lazy TLB. One good of feature of Linux is its extremely low context and mode switch cost, compared to an operating system like Solaris.

COW

Copy-On-Write, efficiency method where a page or other resource is shared until an attempt to write is made. In that case a copy is made, and the write is done to the copy.

CPL

Current Privilege Level. FIXME

critical path

A vital code path which should be optimised for the common case. Critical paths are executed frequently and form the important trunk routes of various kernel operations. An example would be buffer head manipulation during file I/O.

css

Code storage segment, aka text section. This is the memory mapping containing the executable code (text) for a binary image.

Directed Acyclic Graph

FIXME

dancing makefiles

An experimental new Makefile set up for configuring and compiling the kernel, written by Michael Elizabeth Chastain.

dcache

The cache of dentry structures. Under UNIX an entry in a particular directory must be searched for linearly, so even if the disk block containing the directory entry list is in-core, there is an associated cost. The dcache stores recent results of these searches which in general speeds up these disk searches by a large factor. Recent 2.3 work uses the dentries to allow multiple mounting, union mount, and more. The hardware data cache is usually referred to as the D-cache.

deadlock

Any of a number of situations where two or more processes cannot proceed because they are both waiting for the other to release some resource. FIXME(give good references).

delayed write

See write behind.

demand zero

In demand paging, where the page is to be zeroed when actually created (common case: bss segment of an executable image, which is uninitialised heap data for the executable). Also called ZFOD.

dentry

Directory entry, in-core structure defining a file's details: inode, parent dentry etc. Cached in a hash table indexed by hashed filename (see dcache).

DF

IP packet bit indicating it should not be fragmented. The remote host will return ICMP notifications if the packet had to be split anyway, and these are used in MTU discovery.

directory notification

Provides hooks for notifying tasks when the contents of a directory has changed. Note "contents" can refer to dentries, the file inodes, or even the file contents themselves (file notification).

DOD

Dial-On-Demand for PPP connections over a standard telephone line.

drop behind

In stream I/O conditions, data that has already been read and processed is not needed again. The VM ideally should recognise this and mark the used pages as unneeded, so they can be discarded first. This technique is called "drop behind".

dss

Data storage segment, aka data section. This is the memory mapping containing the initialised data for a binary image.

dual-issue

Processors such as the Pentium Pro, that can decode and execute two instructions simultaneously.

dupe

Abbrev. fr. duplication.

dword

Double word, i.e. 4 bytes on x86.

EA

See extended attributes.

eager coalescing

What the buddy allocator currently does, i.e. merge adjacent blocks as soon as possible.

edge-triggered interrupt

The interrupt is triggered by the rising or falling edge of the interrupt line. This makes IRQ line sharing difficult, as an edge may occur while an ISR is running, and it could be easily missed; to allow sharing level-triggered interrupts are usually used.

EIP

Extended Instruction Pointer. This register contains the PC value of a task, that is, it points to the next instruction to be fetched, decoded etc.

elevator algorithm

This algorithm, often used in disk accesses, keeps an ordered list of requests. When the current request on the disk (e.g. the disk block) has been satisfied, the next strictly greater request on the list is dealt with. When a new request arrives, it is inserted into the ordered list in position (e.g. if the new requested block number is less than the current handled request, it goes before it in the list). When reaching the end of the list, the elevator changes direction, and the situation is reversed.

EPIC

Explicitly-Parallel Instruction set Computing, an instruction set architecture where every dependency for an instruction is encoded into the instruction itself. This has the potential to be faster as the compiler can encode the data dependencies in the instructions.

exponential back-off

A general algorithm for dealing with contention cases; for example, collisions on a network bus, or contention for a spinlock.

extended attributes

Also known as multi-part or multi-stream files, files with extended attributes deviate from the principle of files being a simple single data stream. An example of extended attributes is the Macintosh's "resource fork", which is associated with a specific file (known as the "data fork").

fair scheduler

A scheduler which ensures fairness between users, such that a user's process count and associated cost only impacts that user, rather than the whole system as currently. Rik van Riel and Borislav Deianov have both produced different patches to implement this.

false sharing

On SMP caches, when two parts of single block are accessed, neither of which collide with the other, the cache coherency protocol may not be able to detect this, and mark the block as "shared" even when it isn't. This is known as false sharing.

fastpath

The code path most commonly taken, often optimised heavily at the expense of less frequently-taken blocks of code. This is the reason you see so many gotos in core functions - it produces common-path code far more efficient than an optimising compiler can manage.

fd

file descriptor.

filemap

The mapping of a file's contents into memory.

fixed mmap

A user-space request for a mmap starting at a fixed virtual address. Generally not useful or guaranteed to work; a notable exception is overlayed mmaps, where a mmaped area has further mmaps of different types at fixed positions in the map.

FQDN

Fully-Qualified Domain Name, e.g. martyr.darrenemerson.co.uk.

GART

For AGP setups, Graphics Aperture Relocation Table.

gdoc

GNOME's source documentation system (similar to javadoc). Available by CVS from gnome. Kernel driver interface descriptions, built from source using gdoc, are currently being written in 2.3.

gdt

Global Descriptor Table. Something to do with x86 memory segmentation I think (FIXME). See ldt.

get

In the kernel, often means "get a reference to". This may be as simple as incrementing a usage count, or it may imply attempting to retrieve an object from a cache of some sort, or allocating kernel memory. See put.

GKHI

Generalised Kernel Hook Infrastructure, an IBM patch to implement hooks into the kernel code.

group descriptor

On-disk filesystem structure, containing information for a block group, such as the inode bitmap and block bitmap.

HID

Human Interface Device (for USB).

highmem

On the x86 platform, memory addresses accessed using Intel's Physical Address Extension, which allows real memory of up to 64Gb on IA32. Only certain things can be stored in this space (FIXME: only userspace ? or what ? no bh or irqhandler can use it ...)

hton

Possibly byte-swapping conversion from host-endian to network-endian format. The network is big-endian.

IBCS

A standard for executable image interfacing (syscalls etc.). This is orthogonal to binary image file format standards such as ELF.

icache

On lkml, this almost always refers to the inode cache, rather than the hardware instruction cache (I-cache).

IETF

Internet Engineering Task Force, a standards organisation for protocols used on the Internet.

IKD

Integrated Kernel Debugger. A patched version of the kernel containing additional facilities for debugging the kernel.

ILP

Instruction-Level Parallelism, i.e. executing more than one instruction at once. See dual-issue.

incestuous

What you might expect: code with knowledge of used code's internals beyond the API. This breaks modularity, increases complexity and causes bugs, and a design goal is to minimise this behaviour.

inode

As per UNIX, the structure defining a file (except for the filename which is stored in the directory entry). The inode cache is a closed hash table indexed by superblock and inode number, and is only hit when the dentry lookup failed.

inode bitmap

In UNIX-like filesystems, the usage of on-disk inodes is recorded in the inode bitmap, where each set bit indicates a specific allocated inode.

insn

Abbrev. fr. "instruction".

interrupt context

This is code executing in the kernel in response to a generated interrupt, without a process context. For more time-consuming work, a bottom-half handler task can be added to be executed at some later time.

interrupt gate

x86 hardware support for hardware interrupts. Interrupt gate facilities are provided to allow a known entry path from untrusted privilege levels to the interrupt routines.

IO-APIC

Receives interrupts external to a CPU (e.g. from a device) and routes them through the local APIC to the CPU.

iopl

I/O privilege level system call and bit in x86 EFLAGS register.

IPI

Inter-processor Interrupt. On SMP machines this refers to an interrupt sent between CPUs, indicating some event that the other CPU needs to be aware of, for example smp_invalidate_interrupt, which invalidates a CPU's TLB.

ISO9660

On-disk format often used for CDs. See also Joliet.

ISR

Interrupt Service Routine, or interrupt handler. Also, on x86 APICs, In-Service Register, confusingly enough.

jiffy

Basic packet of kernel time, around 10ms on x86. Related to HZ, the basic resolution of the operating system. The timer interrupt is raised each 10ms, which then performs some h/w timer related stuff, and marks a couple of bh's ready to run if applicable.

Joliet

Microsoft extension to ISO9660, commonly used.

journalling

A filesystem that writes all changes to some log file, to preserve filesystem consistency (for example, an inode's link count should equal the number of directory entries pointing to that inode). There are two types; FIXME log only metadata such as link count, whereas FIXME logs both meta-data and data, effectively writing all data twice. ext3, ReiserFS and XFS are all meta-data loggers, the second type are rare as the same effect can be achieved using RAID.

kiobuf

FIXME

kiovec

This allows user-space memory access from other contexts, such as in a bh. This also allows kernel drivers to set up DMA from a device directly to user-space pages.

kmalloc

Kernel memory allocation routine. See vmalloc. kmalloc() ensures physical address contiguity.

kswapd bug

Problems with VM balancing in late 2.3 kernels. People in a variety of situations found that the VM subsystem was far from optimal, leading to the kernel swapping daemon kswapd taking an inordinate amount of CPU resources; thrashing was also a common problem. Work is undergoing to fix these scenarios.

L1

See L2.

L2

The L2 (level 2) cache is a memory cache in the order of half a megabyte wide, located on the motherboard. The much faster L1 cache is in-CPU, but correspondingly smaller. x86 has a two-level cache, while some architectures such as Alpha AXP have an additional L3 cache. Often referred to is the memory hierarchy, which goes something like (with increasing latency and capacity, decreasing cost) registers, L1 cache, L2 cache, L3 cache, main memory, local mass storage device, network.

lazy TLB

Some tasks, such as kernel threads like kswapd, don't have a process context, which renders the costly invalidation of the TLB unnecessary (as the user-level page tables are not used). The task structure member mm indicates the user-level address space, which will be NULL for kernel threads. In this case active_mm indicates the used address space (kernel). These address spaces are known as anonymous. Note also that certain tasks may occasionally temporarily drop the user-level address space and have an anonymous address space for a short while.

ldt

Local Descriptor Table. The ldt is a per-process memory management table used on x86. See gdt.

lease

General term used in unreliably-connected components. A resource is given a lease which eventually expires. This means the connection can break, without leaving state hanging. NFS, being stateless, does not have the concept of leases.

level-triggered interrupt

Also level-sensitive. As used on PCI. PCI's ability to have multiple devices sharing a single interrupt line is made possible by its use of level-triggered interrupts. If interrupts were edge-triggered, it is possible that some of the incoming interrupts might be lost, for example if a second interrupt were to come in while the processor was still in the midst of processing a previous one, the processor would return from servicing the first interrupt but not be aware that a second interrupt had been asserted. With level-triggered interrupts, by contrast, the interrupt line remains active until specifically deactivated by an ISR. This does mean, however, that if a buggy routine forgets to reset the interrupt line after agreeing to deal with the interrupt, then the machine may hang in a loop of constant interrupts.

LFS

Large File Support, specifically files larger than 2Gb on 32-bit systems.

LFU

Least Frequently Used, another MM technique. The tension between recency of access and frequency of access of a an area of memory is one of the crucial parts of any memory management system.

libc

The standard C libraries. Remember that the kernel build makes no use of libc !

LIP

Loop Initialization Primitive, related to target IDs on SCSI busses.

livelock

When tasks are fighting for an exclusive resource, but are also ready to run. For example both tasks may be in a loop checking for a resource's availability.

lkml

The Linux-Kernel Mailing List. See the FAQ for details.

local APIC

On-chip interrupt controller provided on P6 and above Intel CPUs. Linux uses the timer interrupt register if a local APIC is available to provide its timer interrupt (is this true ?). The local APIC is part of a replacement for the old-style 8259 PIC, and receives external interrupts through an IO-APIC if there is one.

loopback device

The fake net device representing the local host net interface, lo. This can also refer to file loopback mounts.

loopback mount

There are two types of mount called loopback mounting. First is the file loopback mount, where a single file on a filesystem itself containing a filesystem is mounted (useful in constructing ISO9660 images for one). The other type, which is possible in 2.4, is "true" loopback mount, where you can do something like mount -t bind /safebin /chrootjail/bin.

LRU

As per the CS literature, Least Recently Used. A selection algorithm where a list is kept, ordered by last usage of each element. Often used in page replacement algorithms. Also Most Recently Used and LFU.

LUN blacklist

This is a blacklist of SCSI devices that do not properly handle probes with a Logical UNit number other than zero. Some mischievous devices respond to all LUNs, while others can hang with a LUN more than zero. LUNs are used in CD changers amongst other devices.

LVM

Logical Volume Manager. This allows several physical partitions to be represented as a single block device, amongst other things.

major fault

A major page fault occurs when an attempt to access a page not currently present in physical memory was made. The page must be swapped in to physical memory by the fault fixup code.

mb

Memory barrier; ensures on SMP systems that each CPU's view of memory is the same, i.e. enforces order. rmb() ensures reads are done in order, and wmb() ensures writes are done in order. Placing a barrier after some code basically makes sure that all previous memory reads or writes are done (as CPUs can re-order usually to increase performance). (FIXME: this needs to be more clear).

medium-often race

A race condition scenario. FIXME

memory rusting

Name for problems with mm code in v2.1, where pages got rusted and stuck in memory when they should really have been swapped out. Was a problem on small-memory machines. Now fixed (right ?)

MOESI

The name for a superset of cache coherency protocols. There are five possible states for a cacheline to be in:

  • Modified: The data in the cacheline has been changed.

  • Owned: The cache owner has exclusive rights to the contents of this data.

  • Exclusive: The cacheline is not present on any other caches.

  • Shared: The cacheline is shared with another cache, but does not need writing back to memory.

  • Invalid: The cacheline is invalid and does not contain useful data.

This protocol is described in more detail in the Schimmel book.

metadata

Information about a file, such as permissions, in contrast to the actual data in the file.

minor fault

A minor page fault occurs when an attempt to access a page present in physical memory, but without the correct permissions. An example is the first write to a second reference to a shared page, when the kernel must perform the copy-on-write and allow the task to update the copied page.

MMIO

Memory-mapped I/O, that is I/O memory (memory on hardware devices) accessible through a memory mapping. Also known as an I/O region.

monotonic

Technical term, meaning very roughly "new data doesn't affect existing data". Used to mean a wide range of different things. For example, a monotonic clock is one which never gives a time prior to the time given previously (that is, sequential reads of the clock always increase).

MSR

Model-Specific Register. For the x86 platform, registers that are not guaranteed as part of the Intel Architecture (i.e. could be not present in future CPU models).

MSS

The Maximum Segment Size is the largest quantity of data that can be transmitted at one time over a net interface.

mtrr

Memory Type Range Registers.FIXME: explain what they do.

MTU

The Maximum Transmission Unit is a parameter that determines the largest datagram than can be transmitted by an IP interface without it needing to be broken down into smaller units. Typical values are 1500 bytes for an ethernet interface, or 576 bytes for a SLIP interface.

MTU discovery

The process of discovering the MTU of a remote site that can be used without causing expensive fragmentation. See DF.

n/w

Used occasionally for "network".

Nagle's Algorithm

TCP algorithm for reducing network traffic in a connection. The local machine will accumulate outgoing data into packets until receiving a packet from the machine, and then the packets are sent. This way small amounts of outgoing data are amalgamated into the packet's larger payloads. Setting the TCP option NODELAY disables this behaviour. FIXME

nbd

Network Block Device - presents a local block device to a remote device.

negative entry

Meaning an error return rather than the expected entry from some vector or cache. A negative dentry is used when an NFS client tries to operate on a stale file, and with other situations. Without the negative dentry, a lookup is done on each request. This would allow a DoS attack by stealing CPU time doing lookups which will fail. So the dentry is kept around to provide a near-immediate "stale file" response to the client. See fs/dcache.c:d_delete(). Apparently positive dentries are not allowed to become negative. A negative dentry is one with a NULL d_inode field.

NMI

Non-maskable interrupt - highest-priority interrupt that can even interrupt standard h/w interrupt ISRs. The NMI watchdog is a kernel facility, running off NMIs, which checks CPUs for recent activity (in the last five minutes), to detect permanently stalled CPUs. NMIs are also delivered by motherboards in the case of serious hardware problems (like bad RAM).

nopage

A routine executed when a virtual address has been accessed when the page is for some reason not available in real memory. An example would be a swapped out page, hence the nopage operation would need to bring the page into real memory from the swap file.

nr_

"Number of", e.g. nr_free_pages(). Preferred to the "no_" form as it's not ambiguous with the negative case (compare no_free_pages()).*

ntoh

Possibly byte-swapping conversion from network-endian to host-endian format. The network is big-endian.

NUMA

Non-Uniform Memory Architecture. Usually on an SMP system, all memory beyond the caches costs an equal amount to reach for each CPU. In NUMA systems, some memory can be accessed more quickly than other parts. Later kernels have some NUMA support for preferring to use memory "nearer" to the CPU, rather than higher-latency distant memory. See Documentation/vm/numa for details.

NVidia

A graphics card manufacturer that produce proprietary, non-open-source/binary kernel module drivers. Complaints about this module go to NVidia; such complaints are not welcome on linux-kernel. Also check #nvidia on irc.openprojects.net.

O()

Notation from Computer Science literature for complexity of algorithms. An O(1) algorithm takes constant time; an O(n) algorithm takes time proportional to N, the number of data, and so on. Also used for space complexity, the number of memory "units" used in the algorithm.

OHCI

Compaq standard for USB controllers.

one-hand algorithm

VM algorithm. A one-handed clock algorithm has one "clock hand" sweeping through pages, performing ageing and other activities.

oob

In networking Out-Of-Band data. FIXME

oops

An oops is generated on some kernel errors. This may be due to kernel bugs or faulty hardware (especially bad RAM). Often the kernel can continue to operate mostly correctly after an oops. More serious or potentially dangerous errors can cause a kernel panic (the O/S freezes). This is sometimes done to protect a filesystem from incurring further harm. An oops report will appear in the kernel logs, but is only useful to developers after decoding it with ksymoops (see Documentation/oops-tracing.txt). Note also that klogd can corrupt oops reports by incorrectly decoding it automatically. It is recommended to pass the -x option to klogd on startup to prevent this from happening. The plural form is oopsen.

out-of-order execution

Some processors (such as Intel Pentium Pro and above) can decode and execute more than one instruction at a time in parallel. This means a sequence of instructions is not necessarily the order in which those instructions are completed. See mb().

PAE36

Physical Address Extension - Intel's 36-bit physical address extensions for highmem.

page ageing

Marking a page as "old" in some way, that is, it hasn't been used for some time. Old pages are preferred for swapping out (although this is a simplification of the page swap selection algorithm), and are aged by kernel threads via a clock algorithm. Confusingly, page->age==0 indicates an infinitely old page (FIXME: true ?). The pte's are also aged for selection for flushing from the page table to the page or swap cache.

page cache

The page cache caches pages mapped from files (e.g. mapping in an executable image).

page colouring

A system to ensure that virtually-contiguous pages are not mapped to the same cache lines, improving performance. FreeBSD does this, but Linux, after some discussion, doesn't (yet).

page stealer

A task which steals from a process's vm, taking pages for itself. An example is kswapd, which runs through the pages of the system swapping out pages to keep freepages.min number of free pages.

pessimal

By analogy with optimal, the worst situation possible in some context.

PFF

FIXME

pgd

Page Global Directory, or the abstracted top level of the multi-level page tables. See pmd, pte. On x86, there are only two real levels of page tables; this fact is abstracted away by macros in the source. Each level of page table deals with different sizes of memory. For example, this global directory may deal with areas 4Mb in size. Each entry will be a pointer to a lower table of a smaller-sized directory. On x86, the next level (middle) is non present in hardware, and is folded in to the pgd in the kernel code. The bottom level deals in pages directly (PAGE_SIZE). So the pgd is a directory of page tables. Code which traverses this structure (such as the bttv driver) is said to walk the page tables.

PGE

FIXME

PIC

Programmable Interrupt Controller. See 8259 PIC, local APIC, IO-APIC.

PIO

The old-fashioned transfer protocol for devices on an IDE bus. Now mostly used is DMA, and standards like UDMA.

pinning

A pinned page cannot be swapped out of main memory.

pmd

Page Middle Directory. The middle level of page tables.

poisoning

Used in the slab cache for debugging purposes. Slabs are filled with a special value after they have been freed in order to catch double frees.

POSIX

A standard specifying semantics and interfaces for a UNIX-like kernel interface, along with standards for user-space facilities. There is a core POSIX definition which must be supported by all POSIX-conforming OSes, and several optional standards for specific facilities; so you may see references to "POSIX shm" or "POSIX real-time signals".

PPPoE

PPP (point-to-point protocol, often used in modem up-links) running over Ethernet.

pre-emption

Involuntary switching of a CPU from one task to another. User-space is pre-empted by interrupts, which can then either return to the process, or schedule another process (a process switch will also occur when the process "voluntarily" gives up the CPU by e.g. waiting for a disk block in kernel mode).

Kernel mode tasks are never pre-empted (except by interrupts) - they are guaranteed use of the CPU until they sleep or yield the CPU. Some kernel code runs with interrupts disabled, meaning nothing except an NMI can interrupt the execution of the code.

pre-zeroing

Many pages (such as those for a process's BSS segment) need to be zeroed when accessed. As the kernel only takes the page on the page fault for these areas, it is possible to pre-zero some amount of pages before the page faults occur. FreeBSD does this, and there has been some inconclusive talk of including this feature in Linux. See demand zero.

printk

Obviously, the equivalent of printf for kernel code (remember, you can't use libc functions in the kernel). Can be called from interrupt context (although it does take the console spinlock) but floating-point cannot be used, as throughout the kernel.

priority inheritance

See priority inversion.

priority inversion

A situation where a low-priority task holds a resource which a higher-priority task is waiting for. This can occur with tasks scheduled with SCHED_OTHER or SCHED_IDLE, where a task with a higher scheduling priority will always be scheduled before in preference. This leads to deadlock as the low-priority task cannot be scheduled in order to release the lock, and the high-priority task can't continue without the lock on the resource. Priority inheritance is a putative solution, where the low-priority task temporarily takes on the priority of the other task, enabling its scheduling and release of the lock. However this approach can itself lead to deadlock (FIXME: is this really true ?).

processor affinity

Some OS's have system calls for binding a process to a particular CPU. The kernel ensures that when the process is re-scheduled it will be run on that processor. Linux currently doesn't support this and its relative merits are debatable.

processor ordering

The ordering of memory read and writes as seen from the CPU's point of view.

process context

Code executing in the kernel on behalf of a user process (from a syscall), is said to have a process context. Code not in a process context is not allowed to sleep or wait (see interrupt context).

process group

A set of user processes sharing the same group pid. One of these processes will be the group leader (same pid as pgid).

process suspension

When a VM is capable of completely suspending processes for a while. Also known as process swapping.

promiscuous

Of a network device, sucks all packets regardless of destination.

pte

Page Table Entry, or a value containing the physical address of a page along with associated bits indicating, for example, that the entry is valid and the related page is present in real memory.

put

In the kernel, means "drop a reference to". This indicates that whatever task was using the object previously is no longer interested. This may just decrement a usage count, or perform a more complicated reaping routine.

QOS

Quality of Service.

race condition

Generally, where the ordering of events can affect the outcome of a situation. These bugs are caused by incomplete mutual exclusion, so a task or interrupt can come in and change something vital under the feet of another task. They are notoriously hard to debug as they are often marginal (often only occur in pathological=="very unlikely" situations).

RAID

Redundant Array of Independent/Inexpensive Disks. See the standard literature.

RAS

Reliability, Availability & Serviceability.

rate limit

To avoid DoS attacks, some facilities have upper limits on how often they can trigger. Most trite example is logging.

raw I/O

I/O that does not go through the caches that Linux provides as standard to all block devices. This feature can be useful to database systems, where the database back-end has a better idea of the data access pattern than the kernel can, hence increasing performance.

read-ahead

Where I/O is performed before strictly required in an attempt to improve performance. This is performed by the Linux kernel, and often by hardware as well.

Readiness Event Queues

See completion ports.

red zone

Adding protection (e.g. read or write) to kernel pages, for use in detecting stack overflow and other situations. This often works by adding a guard area at the end of each buffer in the slab allocator, which is checked to make sure it hasn't been overwritten.

register spilling

FIXME

retired

On out-of-order processors, a retired instruction is one that has been executed and completed.

ring 0

ring 0 is the protection level provided by x86 hardware for kernel mode. Ring 3 is used in Linux for user mode, and rings 1 and 2 are unused. Ring transition to a higher protection level is only possible through gates. Interrupt gates are set up for transitions due to interrupts, and call gates are used for system calls.

ring buffer

A buffer of data which is of fixed size; when it fills, further data is placed back at the start of the buffer, overwriting the old data, in a "ring". Commonly used in device drivers.

root squashing

An NFS option whereby any actions performed by a remote client's root account are actually attempted by the user "nobody" on the server.

RX

Common abbreviation used in net code for "receive".

scatter-gather

An I/O mechanism, whereby devices can write to memory at several physically non-contiguous addresses (scatter), and compose reads from physically non-contiguous addresses in memory (gather).

semaphore

As per the literature, a mutual exclusion or resource management method where tasks sleep waiting for a semaphore to be granted to them. Later versions of the Linux kernel also provide a more complicated type, the read-write semaphore, which allows a task to specify its intentions with regard to the resource. Hence several readers are allowed for the resource, but a writer will have mutual exclusion.

session

Sessions are related to terminal handling. Sessions can have a controlling terminal; by becoming a session group leader, a daemon process can detach itself from any controlling terminal. A session group is a set of processes created by the session leader. FIXME

set_fs

The set_fs() facility enables the copy_*_user routines to access kernel (KERNEL_DS) or user (KERNEL_FS) memory. This is required in kernel code that uses functions that expect to copy data from user-space, so that the correct segment is used (i.e. copying from kernel memory rather than user space memory).

shm

Shared memory. This is memory accessible through more than one processes' virtual address map. There are standardised shm interfaces derived from System V, as well as POSIX.

silly delete

Silly-deleting in NFS is used when somebody tries to delete a file that is held open by some other process. The deletion is delayed, and the file is instead renamed to something along the lines of .nfsxxxxxxxxx. The last process to close the file will then attempt to delete it.

slab allocator

Cache-based kernel memory allocation scheme used by kmalloc(). Bonwick's paper can be found at ftp://ftp.bitmover.com/pub/slab.ps. When actually requiring new pages, __get_free_pages() is called which uses alloc_pages(), with the standard buddy allocator.

softirq

Interrupts run on return to user space or after a hardware interrupt, and there is a fixed number available. See tasklet, bottom-half handler. These methods are usually used to run kernel code resulting from a hardware interrupt of some kind, where the necessary operations take too long to execute in hardirq context. A notable case is routines queuing on TIMER_BH which are executed depending on the hardware timer tick. These are different from DOS soft interrupts as they don't make use of hardware soft interrupt support (int instruction), although these types of soft interrupt are used in kernel code, for example kernel_thread().

SO_LINGER

FIXME

spinlock

A busy-wait method of ensuring mutual exclusion for a resource. Tasks waiting on a spin-lock sit in a busy loop until the spinlock becomes available. On a UP (single processor) system, spinlocks are not used and are optimised out of the kernel. There are also read-write spinlocks, where multiple readers are allowed, but only one writer. See Documentation/spinlocks.txt in the kernel source for details.

sti

See cli/sti.

superblock

An fs-specific data structure, stored on disk, that holds various bits of information about the fs on the disk, such as a pointer to the free blocks structure, the number of free inodes, etc. In the code, the inode manipulation routines are entered in super_operations vector of function pointers. Often abbreviated to sb.

SuS

The Single Unix Specification, a standard defined with the aim of standardisation of Unix-like OS semantics and APIs. There is also a second version of this standard, referred to as SuS v2.

SVR4

System V Release 4, as in the UNIX source that POSIX is mostly based on.

SYN cookies

Protection against TCP/IP SYN flooding DoS attack by storing state information about SYN packets from hosts. See ftp://koobera.math.uic.edu/syncookies.html.

system calls

The method by which a user process requests facilities provided by the kernel. Examples are wait4(2), open(2), and sched_setscheduler(2). By an architecture-specific method, the kernel is executed on behalf of the process in kernel mode at the entry point of the system call. Generally there is a convention in the source that syscall entry points are prefixed with sys_, for example, the entry point to open(2) is the function sys_open in the file fs/open.c. Code executing in kernel mode (that is, with access to the kernel's memory) from a syscall entry point is said to have a process context. On x86, all system calls executed from user mode use int 0x80 and came through the syscall call gate in arch/i386/kernel/entry.S.

tail merging

Tail merging is a technique which collects tails of multiple files together in one block to save space, reducing internal fragmentation, especially at larger block sizes.

task

The kernel abstraction for every user process, user thread, and kernel thread. All of these are handled as tasks in the kernel. Every task is described by its task_struct. User processes/threads have an associated user_struct. When in process context, the process's task_struct is accessible through the routine get_current, which does assembly magic to access the struct, which is stored at the bottom of the kernel stack. When running in kernel mode without process context, the struct at the bottom of the kernel stack refers to the idle task.

tasklet

"Software interrupt" routines running when the software interrupt is received at a return to user space or after a hardware interrupt. They do not run concurrently on multiple CPUs, but are dynamically allocatable. See softirq and bottom-half handler.

task queue

A linked list of processes or threads. Tasks, for example, can queue on bottom-half handlers waiting to be executed. See wait queue.

TC

FIXME

TCQ

Tagged Command Queuing. This facility allows the kernel to send several asynchronous requests at once to a block device. The device can service the requests as it sees fit; the advantage being that the device can arrange the transfers as best suits the device hardware. TCQ works for both reads and writes.

thundering herd

Consider a set of tasks on a wait queue, waiting for some condition. When the event occurs, it is often the case that only some of the tasks has real work to do. Older kernels always woke up all tasks on the wait queue - this meant that each task immediately vied for scheduling - the "thundering herd". More recent kernels have added the TASK_EXCLUSIVE flag to tasks in the wait queue. This ensures that one of the tasks with this flag set will be the only one to be woken up.

TLB flushing

Flagging part or all of a CPU's translation lookaside buffer (which is a virtual-address-indexed cache of physical addresses) as invalid. Depending on hardware, the whole TLB may be flushed at any one time, or just a part.

tmpfs

One of those cyclical lkml threads. O/S's such as Solaris have a "tmpfs" filesystem, providing a fs interface using main memory as storage. This differs from the usual ram disk in that it is swappable. The fs is often used for /tmp, which provides performance gains. Under Linux, ext2fs is already faster than Solaris' tmpfs, so no-one has ever implemented this.

trampoline

The process by which a compiler has special code to deal with code presented as data within a binary's segment, which is to be executed, amongst other scenarios. A short routine is called before the real procedure which sets up the stack environment correctly. This facility is used in languages such as Ada, and complicates provision of non-executable stack support at kernel level (useful for preventing simplistic buffer over-runs). FIXME

TSC

Time Stamp Counter, a very high resolution counter on the Intel platform.

two-hand algorithm

VM algorithm. A two-handed clock algorithm has two "clock hands" sweeping through pages, performing ageing and other activities. Imagine one hand ageing pages, whereas the later one decides to mark pages as unused.

TX

Common abbreviation used in net code for "transmit".

UART

Universal Asynchronous Receiver/Transmitter. UARTs are hardware devices that convert byte-wide data streams in to serial (bit-wide) data streams.

UBC

Unified Buffer Cache - the combination of the buffer and page caches to use the same memory pool (disk I/O and VM use the same pool), and data I/O utilises the page cache.

UHCI

Intel standard for USB controllers.

union mount

Each mount structure is represented in the kernel by a dentry tree. Sharing of a tree allows union mount, where two separate mounts are accessible under one mount point.

uop

For micro-operation, or the basic set of CPU operations which are generated from machine code in microcode processors. Intel P6 processors and above have the facility for upgrading the microcode (stepping of the processor).

URB

USB Request Block, or the message struct for Universal Serial Bus.

user-mode port

A patched version of the kernel modified to run on top of user-space. This is intended to be used for testing, experimentation and debugging. See http://user-mode-linux.sf.net.

UTSL

Use The Source, Luke!

V4L

Video 4 Linux

versioned filesystem

A filesystem that tracks changes to files, for example by storing the entire previous file on a change in data.

VFS

VFS is the filesystem-independent in-kernel interface used to access each particular filesystem supported by Linux.

VLIW

Very Long Instruction Word, instruction sets with large-sized complex instructions encoded into one instruction.

vmalloc

Allocates a virtual address space. Can be slow for some purposes as it requires a TLB flush on all processors. Does not ensure physical address contiguity, but can allocate arbitrary sized areas. Also requires (along with related vfree()) the big kernel lock.

VMWare

Commercial virtual O/S software. If the modules won't compile for you, DO NOT complain to linux-kernel ! And see FreeMWare/Plex86.

volatile

From the C keyword. Used throughout the source and in discussion, signifying a variable or inline asm that cannot be optimised as usual by the compiler. This usually means the compiler shouldn't store the variable in a register, as another task can alter the value of the real memory contents at the time. For example, this is necessary with I/O address spaces, where the hardware device can fill up a buffer asynchronously to a reading task. Variables which are changed from both interrupts and other contexts are also volatile.

wait queue

A linked list of tasks (processes or threads) waiting on some specific condition. For example, a locked page may have a wait queue of tasks waiting for I/O on that page to complete (that is, !PageLocked(page)).

wall time

Real time, that is, the sort you would expect to find on a clock on your wall...

wandering logs

On journalling filesystems, the metadata must be logged. A wandering log system spreads log blocks throughout the filesystem for performance, rather than using a static log area on the disk.

warm cache

A cache whose contents are not invalid and are useful to the task to be run next. It is important to benchmark code in both cold-cache and warm-cache circumstances.

write bomb logic

Code to protect interactivity from being harmed by large writes. FIXME

write behind

A.K.A. delayed write, asynchronously writing after the write request was made.

WS

FIXME

XDR

External Data Representation, the marshalled form of RPC communications. XDR deals with sizing and endianness issues.

XIP

"Execute in-place". For normal storage devices, executables are loaded into main memory before execution. For devices like a ram disk or flash system, it is possible to execute the program straight from the device, without duplication.

zero-copy

Networking where no data is copied to/from userspace during network transmit or receive. This is often touted as a universally good thing, but in fact it is a trade-off between the cost of the copy, and the cost of setting up page tables etc. such that user-space can read the data.

ZFOD

Zero-fill-on-demand, a page filled with zeros for which no space has yet been allocated.

zombie

A completed process whose parent has not waited for it. When a child exits, the return status must remain available to the parent, as it may later do a wait4(). A zombie holds its place in the process table until the parent waits for it.

zone

A particular area of free memory. Zoned allocators differentiate between intended uses, and are generally used to model different characteristics of the memory. For example, on the x86 there is only 16Mb of DMA-able memory; the zoned allocator will try to save DMA pages for processes specifically requesting ZONE_DMA memory. There are currently three zone types: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM (see highmem).