10. Technical Details

10.1. How They Work

insmod makes an init_module system call to load the LKM into kernel memory. Loading it is the easy part, though. How does the kernel know to use it? The answer is that the init_module system call invokes the LKM's initialization routine right after it loads the LKM. insmod passes to init_module the address of the subroutine in the LKM named init_module as its initialization routine.

(This is confusing -- every LKM has a subroutine named init_module, and the base kernel has a system call by that same name, which is accessible via a subroutine in the standard C library also named init_module).

The LKM author set up init_module to call a kernel function that registers the subroutines that the LKM contains. For example, a character device driver's init_module subroutine might call the kernel's register_chrdev subroutine, passing the major and minor number of the device it intends to drive and the address of its own "open" routine among the arguments. register_chrdev records in base kernel tables that when the kernel wants to open that particular device, it should call the open routine in our LKM.

But the astute reader will now ask how the LKM's init_module subroutine knew the address of the base kernel's register_chrdev subroutine. This is not a system call, but an ordinary subroutine bound into the base kernel. Calling it means branching to its address. So how does our LKM, which was not compiled anywhere near the base kernel, know that address? The answer to this is the relocation that happens at insmod time.

How that relocation happens is fundamentally different between Linux 2.4 and Linux 2.6. In 2.6, insmod pretty much just passes the verbatim contents of the LKM file (.ko file) to the kernel and the kernel does the relocation. In 2.4, insmod does the relocation and passes a fully linked module image, ready to stuff into kernel memory, to the kernel. The description below covers the 2.4 case.

insmod functions as a relocating linker/loader. The LKM object file contains an external reference to the symbol register_chrdev. insmod does a query_module system call to find out the addresses of various symbols that the existing kernel exports. register_chrdev is among these. query_module returns the address for which register_chrdev stands and insmod patches that into the LKM where the LKM refers to register_chrdev.

If you want to see the kind of information insmod can get from a query_module system call, look at the contents of /proc/ksyms.

Note that some LKMs call subroutines in other LKMs. They can do this because of the __ksymtab and .kstrtab sections in the LKM object files. These sections together list the external symbols within the LKM object file that are supposed to be accessible by other LKMs inserted in the future. insmod looks at __ksymtab and .kstrtab and tells the kernel to add those symbols to its exported kernel symbols table.

To see this for yourself, insert the LKM msdos.o and then notice in /proc/ksyms the symbol fat_add_cluster (which is the name of a subroutine in the fat.o LKM). Any subsequently inserted LKM can branch to fat_add_cluster, and in fact msdos.o does just that.

10.2. The .modinfo Section

An ELF object file consists of various named sections. Some of them are basic parts of an object file, for example the .text section contains executable code that a loader loads. But you can make up any section you want and have it used by special programs. For the purposes of Linux LKMs, there is the .modinfo section. An LKM doesn't have to have a section named .modinfo to work, but the macros you're supposed to use to code an LKM cause one to be generated, so they generally do.

To see the sections of an object file, including the .modinfo section if it exists, use the objdump program. For example:

To see all the sections in the object file for the msdos LKM:
objdump msdos.o --section-headers
To see the contents of the .modinfo section:
objdump msdos.o --full-contents --section=.modinfo

You can use the modinfo program to interpret the contents of the .modinfo section.

So what is in the .modinfo section and who uses it? insmod uses the .modinfo section for the following:

10.3. The __ksymtab And .kstrtab Sections

Two other sections you often find in an LKM object file are named __ksymtab and .kstrtab. Together, they list symbols in the LKM that should be accessible (exported) to other parts of the kernel. A symbol is just a text name for an address in the LKM. LKM A's object file can refer to an address in LKM B by name (say, getBinfo"). When you insert LKM A, after having inserted LKM B, insmod can insert into LKM A the actual address within LKM B where the data/subroutine named getBinfo is loaded.

See Section 10.1 for more mind-numbing details of symbol binding.

10.4. Ksymoops Symbols

insmod adds a bunch of exported symbols to the LKM as it loads it. These symbols are all intended to help ksymoops do its job. ksymoops is a program that interprets and "oops" display. And "oops" display is stuff that the Linux kernel displays when it detects an internal kernel error (and consequently terminates a process). This information contains a bunch of addresses in the kernel, in hexadecimal.

ksymoops looks at the hexadecimal addresses, looks them up in the kernel symbol table (which you see in /proc/ksyms (Linux 2.4) or /proc/kallsyms (Linux 2.6), and translates the addresses in the oops message to symbolic addresses, which you might be able to look up in an assembler listing.

So lets say you have an LKM crash on you. The oops message contains the address of the instruction that choked, and what you want ksymoops to tell you is 1) in what LKM is that instruction, and 2) where is the instruction relative to an assembler listing of that LKM? Similar questions arise for the data addresses in the oops message.

To answer those questions, ksymoops must be able to get the loadpoints and lengths of the various sections of the LKM from the kernel symbol table.

Well, in Linux 2.4, insmod knows those addresses, so it just creates symbols for them and includes them in the symbols it loads with the LKM.

In particular, those symbols are named (and you can see this for yourself by looking at /proc/ksyms):

__insmod_name_Ssectionname_Llength

name is the LKM name (as you would see in /proc/modules.

sectionname is the section name, e.g. .text (don't forget the leading period).

length is the length of the section, in decimal.

The value of the symbol is, of course, the address of the section.

Insmod also adds a pretty useful symbol that tells from what file the LKM was loaded. That symbol's name is

__insmod_name_Ofilespec_Mmtime_Vversion

name is the LKM name, as above.

filespec is the file specification that was used to identify the file containing the LKM when it was loaded. Note that it isn't necessarily still under that name, and there are multiple file specifications that might have been used to refer to the same file. For example, ../dir1/mylkm.o and /lib/dir1/mylkm.o.

mtime is the modification time of that file, in the standard Unix representation (seconds since 1969), in hexadecimal.

version tells the kernel version level for which the LKM was built (same as in the .modinfo section). It is the value of the macro LINUX_VERSION_CODE in Linux's linux/version.h file. For example, 132101.

The value of this symbol is meaningless.

In Linux 2.6, it works differently. (I haven't figured out how yet).

10.5. Other Symbols

insmod adds another symbol, similar to the ksymoops symbols. This one tells where the persistent data lives in the LKM, which rmmod needs to know in order to save the persistent data.

__insmod_name_Plength

10.6. Debugging Symbols

There is another kind of symbol that relates to an LKM: kallsyms symbols. These are not exported symbols; they do not show up in proc/ksyms. They refer to addresses in the kernel that are nobody's business except the module they are in, and are not meant to be referenced by anything except a debugger. Kdb, the kernel debugger that comes with the Linux kernel, is one user of kallsyms symbols.

The kallsyms facility works for both the base kernel and LKMs. For the base kernel, you select it when you build the base kernel, with the CONFIG_KALLSYMS configuration option. When you do that, the kernel contains a kallsyms symbol for all the symbols in the base kernel object files. You know your base kernel is participating in the kallsyms facility if you see the symbol __start___kallsyms in /proc/ksyms.

For an LKM, you decide at load time whether it will contain kallsyms symbols. You include the kallsyms definitions in the data you pass to the init_module system call to load the LKM. insmod does this if either 1) you specify the --kallsyms option, or 2) insmod determines, by looking at /proc/ksyms, that the base kernel is participating in the kallsyms facility. The kallsyms that insmod defines are all the symbols in the LKM object file. To wit, those are the symbols you see when you run nm on the LKM object file.

Each loaded LKM that is participating in kallsyms has its own kallsyms symbol table. When the base kernel is participating in the kallsyms facility, the individual LKM kallysms symbol tables are linked into a master symbol table so that a debugger can look up a symbol anywhere in the kernel. When the base kernel is not participating in kallsyms, a debugger must look explicitly at a particular LKM to find symbols for that LKM. Kdb, for one, cannot do this. So the basic rule is: If you're going to do any kernel debugging, use CONFIG_KALLSYMS.

Note that the __kallsyms section has nothing to do with LKMs. That's a section in the base kernel object module. The base kernel doesn't have the luxury of something as high-level as sophisticated as insmod to load it, so it needs that extra object file section to facilitate its participation in kallsyms.

Similarly, the program kallsyms has nothing to do with LKMs. It is what creates the __kallsyms section.

There is another kind of debugging symbol -- the kind that gcc creates with its -g option. These are unrelated to the kallsyms facility. They do not get loaded into kernel memory. Kdb does not use them. But Kgdb (which gets information both from kernel memory and source and object files) does.

10.7. Memory Allocation For Loading

This section is about how Linux allocates memory in which to load an LKM. It is not about how an LKM dynamically allocates memory, which is the same as for any other part of the kernel.

The memory where an LKM resides is a little different from that where the base kernel resides. The base kernel is always loaded into one big contiguous area of real memory, whose real addresses are equal to its virtual addresses. That's possible because the base kernel is the first thing ever to get loaded (besides the loader) -- it has a wide open empty space in which to load. And since the Linux kernel is not pageable, it stays in its homestead forever.

By the time you load an LKM, real memory is all fragmented -- you can't simply add the LKM to the end of the base kernel. But the LKM needs to be in contiguous virtual memory, so Linux uses vmalloc to allocate a contiguous area of virtual memory (in the kernel address space), which is probably not contiguous in real memory. But the memory is still not pageable. The LKM gets loaded into real page frames from the start, and stays in those real page frames until it gets unloaded.

This design is based on the principle that it's much easier to get a large contiguous area of virtual memory than to get a large contiguous area of real memory because there is orders of magnitude more virtual address space than real memory. In the early days of virtual memory, that was certainly true -- a 32 bit machine had a 4GiB virtual address space and rarely more than 64 MiB of real memory. Now, however, it's quite often just the reverse, with virtual address space being the constrained resource. So this LKM loading strategy may not make sense on your system.

Some CPUs can take advantage of the properties of the base kernel to effect faster access to base kernel memory. For example, on one machine, the entire base kernel is covered by one page table entry and consequently one entry in the translation lookaside buffer (TLB). Naturally, that TLB entry is virtually always present. For LKMs on this machine, there is a page table entry for each memory page into which the LKM is loaded. Much more often, the entry for a page is not in the TLB when the CPU goes to access it, which means a slower access.

This effect is probably trivial.

It is also said that PowerPC Linux does something with its address translation so that transferring between accessing base kernel memory to accessing LKM memory is costly. I don't know anything solid about that.

The base kernel contains within its prized contiguous domain a large expanse of reusable memory -- the kmalloc pool. In some versions of Linux, the module loader tries first to get contiguous memory from that pool into which to load an LKM and only if a large enough space was not available, go to the vmalloc space. Andi Kleen submitted code to do that in Linux 2.5 in October 2002. He claims the difference is in the several per cent range.

10.8. Linux internals

If you're interested in the internal workings of the Linux kernel with respect to LKMs, this section can get you started. You should not need to know any of this in order to develop, build, and use LKMs.

The code to handle LKMs is in the source files kernel/module.c in the Linux source tree.

The kernel module loader (see Section 5.4) lives in kernel/kmod.c.

(Ok, that wasn't much of a start, but at least I have a framework here for adding this information in the future).