Linux Memory Management Overview

The HyperNews Linux KHG Discussion Pages

Linux Memory Management Overview

[Note: This overview of Linux's Memory Management is several years old. Linux's MM has gone through a nearly complete rewrite since this was written. However, if you can't understand the Linux MM code, reading this and understanding that this documents the predecessor to the current MM code may help you out.]

The Linux memory manager implements demand paging with a copy-on-write strategy relying on the 386's paging support. A process acquires its page tables from its parent (during a fork()) with the entries marked as read-only or swapped. Then, if the process tries to write to that memory space, and the page is a copy-on-write page, it is copied, and the page is marked read-write. An exec() results in the reading in of a page or so from the executable. The process then faults in any other pages it needs.

Each process has a page directory which means it can access 1 KB of page tables pointing to 1 MB of 4 KB pages which is 4 GB of memory. A process' page directory is initialized during a fork by copy_page_tables(). The idle process has its page directory initialized during the initialization sequence.

Each user process has a local descriptor table that contains a code segment and data-stack segment. These user segments extend from 0 to 3 GB (0xc0000000). In user space, linear addresses and logical addresses are identical.

On the 80386, linear address run from 0GB to 4GB. A linear address points to a particular memory location within this space. A linear address is not a physical address--it is a virtual address. A logical address consists of a selector and an offset. The selector points to a segment and the offset tells how far into that segment the address is located)

The kernel code and data segments are priveleged segments defined in the global descriptor table and extend from 3 GB to 4 GB. The swapper page directory (swapper_page_dir is set up so that logical addresses and physical addresses are identical in kernel space.

The space above 3 GB appears in a process' page directory as pointers to kernel page tables. This space is invisible to the process in user mode but the mapping becomes relevant when privileged mode is entered, for example, to handle a system call. Supervisor mode is entered within the context of the current process so address translation occurs with respect to the process' page directory but using kernel segments. This is identically the mapping produced by using the swapper_pg_dir and kernel segments as both page directories use the same page tables in this space. Only task[0] (the idle task, sometimes called the swapper task for historical reasons, even though it has nothing to do with swapping in the Linux implementation) uses the swapper_pg_dir directly.

The user process' segment_base = 0x00, page_dir private to the process.
user process makes a system call: segment_base=0xc0000000 page_dir = same user page_dir.
swapper_pg_dir contains a mapping for all physical pages from 0xc0000000 to 0xc0000000 + end_mem, so the first 768 entries in swapper_pg_dir are 0's, and then there are 4 or more that point to kernel page tables.
The user page directories have the same entries as swapper_pg_dir above 768. The first 768 entries map the user space.

The upshot is that whenever the linear address is above 0xc0000000 everything uses the same kernel page tables.

The user stack sits at the top of the user data segment and grows down. The kernel stack is not a pretty data structure or segment that I can point to with a ``yon lies the kernel stack.'' A kernel_stack_frame (a page) is associated with each newly created process and is used whenever the kernel operates within the context of that process. Bad things would happen if the kernel stack were to grow below its current stack frame. [Where is the kernel stack put? I know that there is one for every process, but where is it stored when it's not being used?]

User pages can be stolen or swapped. A user page is one that is mapped below 3 GB in a user page table. This region does not contain page directories or page tables. Only dirty pages are swapped.

Minor alterations are needed in some places (tests for process memory limits comes to mind) to provide support for programmer defined segments.

[There is now a modify_ldt() system call used by dosemu, Wine, TWIN, and Wabi to create arbitrary segments.]

Physical memory

Here is a map of physical memory before any user processes are executed. The column on the left gives the starting address of the item, numbers in italics are approximate. The column in the middle names the item(s). The column on the far right gives the relevant routine or variable name or explains the entry.

0x110000 FREE memory_end or high_memory

mem_map mem_init()

inode_table inode_init()

device data device_init()*

0x100000 more pg_tables paging_init()

0x0A0000 RESERVED

0x060000 FREE

low_memory_start

0x006000 kernel code + data

floppy_track_buffer

bad_pg_table
bad_page used by page_fault_handlers to kill processes gracefully when out of memory.

0x002000 pg0 the first kernel page table.

0x001000 swapper_pg_dir the kernel page directory.

0x000000 null page

*device-inits that acquire memory are(main.c): profil_buffer, con_init, psaux_init, rd_init, scsi_dev_init.

Note that all memory not marked as FREE is RESERVED (mem_init). RESERVED pages belong to the kernel and are never freed or swapped.

A user process' view of memory

0xc0000000	The invisible kernel	reserved
	initial stack
	room for stack growth	4 pages
0x60000000	shared libraries
`brk`	unused
	malloc memory
`end_data`	uninitialized data
`end_code`	initialized data
0x00000000	text

Both the code segment and data segment extend all the way from 0x00 to 3 GB. Currently the page fault handler do_wp_page checks to ensure that a process does not write to its code space. However, by catching the SEGV signal, it is possible to write to code space, causing a copy-on-write to occur. The handler do_no_page ensures that any new pages the process acquires belong to either the executable, a shared library, the stack, or lie within the brk value.

A user process can reset its brk value by calling sbrk(). This is what malloc() does when it needs to. The text and data portions are allocated on separate pages unless one chooses the -N compiler option. Shared library load addresses are currently taken from the shared image itself. The address is between 1.5 GB and 3 GB, except in special cases.

User process Memory Allocation

swappable shareable

a few code pages Y Y

a few data pages Y N?

stack Y N

pg_dir N N

code/data page_table N N

stack page_table N N

task_struct N N

kernel_stack_frame N N

shlib page_table N N

a few shlib pages Y Y?

[What do the question marks mean? Do they mean that they might go either way, or that you are not sure?]

	swappable	shareable
a few code pages	Y	Y
a few data pages	Y	N?
stack	Y	N
`pg_dir`	N	N
code/data `page_table`	N	N
stack `page_table`	N	N
`task_struct`	N	N
`kernel_stack_frame`	N	N
shlib `page_table`	N	N
a few shlib pages	Y	Y?

The stack, shlibs and data are too far removed from each other to be spanned by one page table. All kernel page_tables are shared by all processes so they are not in the list. Only dirty pages are swapped. Clean pages are stolen so the process can read them back in from the executable if it likes. Mostly only clean pages are shared. A dirty page ends up shared across a fork until the parent or child chooses to write to it again.

Memory Management data in the process table

Here is a summary of some of the data kept in the process table which is used for memory managment:

Process memory limits: ulong start_code, end_code, end_data, brk, start_stack;
Page fault counting: ulong min_flt, maj_flt, cmin_flt, cmaj_flt
Local descriptor table: struct desc_struct ldt[32] is the local descriptor table for task.
rss: number of resident pages.
swappable: if 0, then process's pages will not be swapped.
kernel_stack_page: pointer to page allocated in fork.
saved_kernel_stack: V86 mode stuff
struct tss Stack segments esp0 kernel stack pointer (kernel_stack_page) ss0 kernel stack segment (0x10) esp1 = ss1 = esp2 = ss2 = 0 unused privelege levels. Segment selectors: ds = es = fs = gs = ss = 0x17, cs = 0x0f All point to segments in the current ldt[]. cr3: points to the page directory for this process. ldt: _LDT(n) selector for current task's LDT.

Memory initialization

In start_kernel() (main.c) there are 3 variables related to memory initialization:

memory_start starts out at 1 MB. Updated by device initialization.

memory_end end of physical memory: 8 MB, 16 MB, or whatever.

low_memory_start end of the kernel code and data that is loaded initially.

Each device init typically takes memory_start and returns an updated value if it allocates space at memory_start (by simply grabbing it). paging_init() initializes the page tables in the {\tt swapper_pg_dir} (starting at 0xc0000000) to cover all of the physical memory from memory_start to memory_end. Actually the first 4 MB is done in startup_32 (head.S). memory_start is incremented if any new page_tables are added. The first page is zeroed to trap null pointer references in the kernel.

In sched_init() the ldt and tss descriptors for task[0] are set in the GDT, and loaded into the TR and LDTR (the only time it's done explicitly). A trap gate (0x80) is set up for system_call(). The nested task flag is turned off in preparation for entering user mode. The timer is turned on. The task_struct for task[0] appears in its entirety in <linux/sched.h>.

mem_map is then constructed by mem_init() to reflect the current usage of physical pages. This is the state reflected in the physical memory map of the previous section.

Then Linux moves into user mode with an iret after pushing the current ss, esp, etc. Of course the user segments for task[0] are mapped right over the kernel segments so execution continues exactly where it left off.

task[0]:

pg_dir: = swapper_pg_dir which means the the only addresses mapped are in the range 3 GB to 3 GB + high_memory.
LDT[1]: = user code, base=0xc0000000, size = 640K
LDT[2]: = user data, base=0xc0000000, size = 640K

The first exec() sets the LDT entries for task[1] to the user values of base = 0x0, limit = TASK_SIZE = 0xc0000000. Thereafter, no process sees the kernel segments while in user mode.

Processes and the Memory Manager

Memory-related work done by fork():

Memory allocation
- 1 page for the task_struct.
- 1 page for the kernel stack.
- 1 for the pg_dir and some for pg_tables (copy_page_tables)
Other changes
- ss0 set to kernel stack segment (0x10) to be sure?
- esp0 set to top of the newly allocated kernel_stack_page
- cr3 set by copy_page_tables() to point to newly allocated page directory.
- ldt = _LDT(task_nr) creates new ldt descriptor.
- descriptors set in gdt for new tss and ldt[].
- The remaining registers are inherited from parent.

The processes end up sharing their code and data segments (although they have separate local desctriptor tables, the entries point to the same segments). The stack and data pages will be copied when the parent or child writes to them (copy-on-write).

Memory-related work done by exec():

memory allocation
- 1 page for exec header entire file for omagic
- 1 page or more for stack (MAX_ARG_PAGES)
clear_page_tables() used to remove old pages.
change_ldt() sets the descriptors in the new LDT[]
ldt[1] = code base=0x00, limit=TASK_SIZE
ldt[2] = data base=0x00, limit=TASK_SIZE
These segments are DPL=3, P=1, S=1, G=1. type=a (code) or 2 (data)
Up to MAX_ARG_PAGES dirty pages of argv and envp are allocated and stashed at the top of the data segment for the newly created user stack.
Set the instruction pointer of the caller eip = ex.a_entry
Set the stack pointer of the caller to the stack just created (esp = stack pointer) These will be popped off the stack when the caller resumes.
update memory limits
end_code = ex.a_text
end_data = end_code + ex.a_data
brk = end_data + ex.a_bss

Interrupts and traps are handled within the context of the current task. In particular, the page directory of the current process is used in address translation. The segments, however, are kernel segments so that all linear addresses point into kernel memory. For example, assume a user process invokes a system call and the kernel wants to access a variable at address 0x01. The linear address is 0xc0000001 (using kernel segments) and the physical address is 0x01. The later is because the process' page directory maps this range exactly as page_pg_dir.

The kernel space (0xc0000000 + high_memory) is mapped by the kernel page tables which are themselves part of the RESERVED memory. They are therefore shared by all processes. During a fork copy_page_tables() treats RESERVED page tables differently. It sets pointers in the process page directories to point to kernel page tables and does not actually allocate new page tables as it does normally. As an example the kernel_stack_page (which sits somewhere in the kernel space) does not need an associated page_table allocated in the process' pg_dir to map it.

The interrupt instruction sets the stack pointer and stack segment from the privilege 0 values saved in the tss of the current task. Note that the kernel stack is a really fragmented object--it's not a single object, but rather a bunch of stack frames each allocated when a process is created, and released when it exits. The kernel stack should never grow so rapidly within a process context that it extends below the current frame.

Acquiring and Freeing Memory: Paging Policy

[Note: swapping has also been massively changed in recent kernels, with the ``kswap'' changes.]

When any kernel routine wants memory it ends up calling get_free_page(). This is at a lower level than kmalloc() (in fact kmalloc() uses get_free_page() when it needs more memory).

get_free_page() takes one parameter, a priority. Possible values are GFP_BUFFER, GFP_KERNEL, GFP_NFS, and GFP_ATOMIC. It takes a page off of the free_page_list, updates mem_map, zeroes the page and returns the physical address of the page (note that kmalloc() returns a physical address. The logic of the mm depends on the identity map between logical and physical addresses).

That itself is simple enough. The problem, of course, is that the free_page_list may be empty. If you did not request an atomic operation, at this stage, you enter into the realm of page stealing which we'll go into in a moment. As a last resort (and for atomic requests) a page is torn off from the secondary_page_list (as you may have guessed, when pages are freed, the secondary_page_list gets filled up first).

The actual manipulation of the page_lists and mem_map occurs in this mysterious macro called REMOVE_FROM_MEM_QUEUE() which you probably never want to look into. Suffice it to say that interrupts are disabled. [I think that this should be explained here. It is not that hard...]

Now back to the page stealing bit. get_free_page() calls try_to_free_page() which repeatedly calls shrink_buffers() and swap_out() in that order until it is successful in freeing a page. The priority is increased on each successive iteration so that these two routines run through their page stealing loops more often.

Here's one run through swap_out():

Run through the process table and get a swappable task, say, Q.
Find a user page table (not RESERVED) in Q's space.
For each page in the table try_to_swap_out(page).
Quit when a page is freed.

Note that swap_out() (called by try_to_free_page()) maintains static variables so it may resume the search where it left off on the previous call.

try_to_swap_out() scans the page tables of all user processes and enforces the stealing policy:

Do not fiddle with RESERVED pages.
Age the page if it is marked accessed (1 bit).
Don't tamper with recently acquired pages (last_free_pages[]).
Leave dirty pages with map_counts > 1 alone.
Decrement the map_count of clean pages.
Free clean pages if they are unmapped.
Swap dirty pages with a map_count of 1.

Of these actions, 6 and 7 will stop the process as they result in the actual freeing of a physical page. Action 5 results in one of the processes losing an unshared clean page that was not accessed recently (decrement Q->rss) which is not all that bad, but the cumulative effects of a few iterations can slow down a process considerably. At present, there are 6 iterations, so a page shared by 6 processes can get stolen if it is clean.

Page table entries are updated and the TLB invalidated.

The actual work of freeing the page is done by free_page(), the complement of get_free_page(). It ignores RESERVED pages, updates mem_map, then frees the page and updates the page_lists if it is unmapped. For swapping (in 6 above), write_swap_page() gets called and does nothing remarkable from the memory management perspective.

The details of shrink_buffers() would take us too far afield. Essentially it looks for free buffers, then writes out dirty buffers, then goes at busy buffers and calls free_page() when its able to free all the buffers on a page.

Note that page directories and page tables along with RESERVED pages do not get swapped, stolen or aged. They are mapped in the process page directory through reserved page tables. They are freed only on exit from the process.

The page fault handlers

When a process is created via fork, it starts out with a page directory and a page or so of the executable. So the page fault handler is the source of most of a processes' memory.

The page fault handler do_page_fault() retrieves the faulting address from the register cr2. The error code (retrieved in sys_call.S) differentiates user/supervisor access and the reason for the fault--write protection or a missing page. The former is handled by do_wp_page() and the latter by do_no_page().

If the faulting address is greater than TASK_SIZE the process receives a SIGKILL. [Why this check? This can only happen in kernel mode because of segment level protection.]

These routines have some subtleties as they can get called from an interrupt. You can't assume that it is the ``current'' task that is executing.

do_no_page() handles three possible situations:

The page is swapped.
The page belongs to the executable or a shared library.
The page is missing--a data page that has not been allocated.

In all cases get_empty_pgtable() is called first to ensure the existence of a page table that covers the faulting address. In case 3 get_empty_page() is called to provide a page at the required address and in case of the swapped page, swap_in() is called.

In case 2, the handler calls share_page() to see if the page is shareable with some other process. If that fails it reads in the page from the executable or library (It repeats the call to share_page() in case another process did the same meanwhile). Any portion of the page beyond the brk value is zeroed.

A page read in from the disk is counted as a major fault (maj_flt). This happens with a swap_in() or when it is read from the executable or a library. Other cases are deemed minor faults (min_flt).

When a shareable page is found, it is write-protected. A process that writes to a shared page will then have to go through do_wp_page() which does the copy-on-write.

do_wp_page() does the following:

Send SIGSEGV if any user process is writing to current code_space.
If the old page is not shared then just unprotect it.
Else get_free_page() and copy_page(). The page acquires the dirty flag from the old page. Decrement the map count of the old page.

Paging

Paging is swapping on a page basis rather than by entire processes. We will use swapping here to refer to paging, since Linux only pages, and does not swap, and people are more used to the word ``swap'' than ``page.'' Kernel pages are never swapped. Clean pages are also not written to swap. They are freed and reloaded when required. The swapper maintains a single bit of aging info in the PAGE_ACCESSED bit of the page table entries. [What are the maintainance details? How is it used?]

Linux supports multiple swap files or devices which may be turned on or off by the swapon and swapoff system calls. Each swapfile or device is described by a struct swap_info_struct (swap.c).

static struct swap_info_struct {
      unsigned long flags;
      struct inode * swap_file;
      unsigned int swap_device;
      unsigned char * swap_map;
      char * swap_lockmap;
      int lowest_bit;
      int highest_bit;
} swap_info[MAX_SWAPFILES];

The flags field (SWP_USED or SWP_WRITEOK) is used to control access to the swap files. When SWP_WRITEOK is off space will not be allocated in that file. This is used by swapoff when it tries to unuse a file. When swapon adds a new swap file it sets SWP_USED. A static variable nr_swapfiles stores the number of currently active swap files. The fields lowest_bit and highest_bit bound the free region in the swap file and are used to speed up the search for free swap space.

The user program mkswap initializes a swap device or file. The first page contains a signature (`SWAP-SPACE') in the last 10 bytes, and holds a bitmap. Initially 0's in the bitmap signal bad pages. A `1' in the bitmap means the corresponding page is free. This page is never allocated so the initialization needs to be done just once.

The syscall swapon() is called by the user program swapon typically from /etc/rc. A couple of pages of memory are allocated for swap_map and swap_lockmap.

swap_map holds a byte for each page in the swapfile. It is initialized from the bitmap to contain a 0 for available pages and 128 for unusable pages. It is used to maintain a count of swap requests on each page in the swap file. swap_lockmap holds a bit for each page that is used to ensure mutual exclusion when reading or writing swap files.

When a page of memory is to be swapped out an index to the swap location is obtained by a call to get_swap_page(). This index is then stored in bits 1-31 of the page table entry so the swapped page may be located by the page fault handler, do_no_page() when needed.

The upper 7 bits of the index give the swapfile (or device) and the lower 24 bits give the page number on that device. That makes as many as 128 swapfiles, each with room for about 64 GB, but the space overhead due to the swap_map would be large. Instead the swapfile size is limited to 16 MB, because the swap_map then takes 1 page.

The function swap_duplicate() is used by copy_page_tables() to let a child process inherit swapped pages during a fork. It just increments the count maintained in swap_map for that page. Each process will swap in a separate copy of the page when it accesses it.

swap_free() decrements the count maintained in swap_map. When the count drops to 0 the page can be reallocated by get_swap_page(). It is called each time a swapped page is read into memory (swap_in()) or when a page is to be discarded (free_one_table(), etc.).

Messages

0x110000	FREE	`memory_end` or `high_memory`
	`mem_map`	`mem_init()`
	`inode_table`	`inode_init()`
	device data	`device_init()`*
0x100000	more `pg_table`s	`paging_init()`
0x0A0000	RESERVED
0x060000	FREE
	`low_memory_start`
0x006000	kernel code + data
	`floppy_track_buffer`
	`bad_pg_table` `bad_page`	used by `page_fault_handler`s to kill processes gracefully when out of memory.
0x002000	`pg0`	the first kernel page table.
0x001000	`swapper_pg_dir`	the kernel page directory.
0x000000	null page

`memory_start`	starts out at 1 MB. Updated by device initialization.
`memory_end`	end of physical memory: 8 MB, 16 MB, or whatever.
`low_memory_start`	end of the kernel code and data that is loaded initially.