"Linux Gazette...making Linux just a little more fun!"
Dispelling the Kernel Compiling Myth
"Thou hast to recompile thee kernel".
This antique curse has been thrown on every Linux newcomer since the
birth of Linux. Unfortunately as long as kernel recompiling is deemed a
necessary part of a Linux installation it will be impossible to spread
Linux between non-nerds. In this article we will make a detailed analysis
of the performance increases one can expect of kernel compiling.
"Thanks to kernel recompiling you can free your installation kernel of
much unneeded bloat. You also should compile permanently used modules in
the kernel for additional savings. A leaner kernel will make your computer
faster thanks to reducing paging".
Let's quantify this.
To begin with we will see module compiling. Compiling a module in the
kernel will save a little more than 2K per module: 2K due to page alignment
and a small bit of code for the loading, unloading of the module. Now,
despite being a module fanatic I never managed to be in a situation with
more than ten modules loaded, but let's imagine you have 20 modules loaded
and all of them are needed permanently so you recompile them in the kernel.
You would save 40K of memory, that is 0.5% of the memory of an 8 Meg computer.
Now we will look at benefits of a lean kernel. When Matt Welsh wrote
his books kernel recompiling was undoubtedly necessary. It was not uncommon
to be able to save above 1.5 Megs of memory and your average computer had
8 Megs of RAM. Thus recompiling would increase memory available from 5.5
to 7 Megs that is a 27% increase.
But people failed to notice that Linux has gone modular and computers
got more memory. Today most distributions ship modular kernels so recompiling
will get benefits much smaller than in 1995. As an example I tested recompiling
the kernel shipped in RedHat 5.2 with everything unneeded thrown out and
modularizing everything else when it was possible. The boot messages (that
is before loading of any module) showed I had saved a mere 400K. In addition
today even low end computers have 32 Megs of RAM that means that recompiling
your kernel will increase your available memory of only 1.25%
It is possible to write a specially designed program who will not do
a single page fault with N Megs of memory and thrash horribly if you reduce
it by a single page. However in normal situations a 1.25% increase in memory
available will make little difference. There ARE still a couple distributions
who ship kernels good for little else outside installation: huge kernels
lacking essential features so recompiling is not a performance issue but
a requirement. Now consider what happens if a small company without a full-time
guru needs a firewall. Its expert is good for little else short of starting
Word. If he stumbles upon a distribution with one of those broken kernels
he will fail and will end recommending NT.
Most modern distribs (Caldera, Suse, RedHat and their clones) ship fully-featured
kernels and in addition kernel recompiling will produce no appreciable
speed increase due to memory savings: they are good enough out of the box.
Only a couple of "hackeristic" distribs will force you to recompile the
kernel. But for the good of Linux you should ask the maintainers to fix
them instead of supplying for their deficiencies. YOU can recompile but
your neighbour cannot and he will choose NT.
Evaluating CPU speedups due to recompiling
"Recompiling will allow you to build a faster kernel because you will be
able to compile for the right CPU".
Again let's quantify this. Linux performs a number of optimizations
for CPU type but most of them are performed at execution time and don't
depend on compiling options. For one part we will quantify the influence
due to alternative portions of code being compiled and we will also take
a look at the influence of compilation options in the code generated by
Effect of the ifdefs
If you take a look at the source code of the 2.0 kernel you will notice
only two portions of code whose inclusion depends on CPU type. The first
one is related to selective invalidation of TLB entries and the second
one is related to the way used for swapping bytes. In both cases the choice
is 386 versus everything else. There was a third portion of code who depended
on CPU time: the way blocks of memory were copied: the fastest way for
386 and PPros, Pentim IIs is slightly sub-optimal on 486s and much slower
on plain Pentiums. However this optimization has been disabled and now
whatever CPU you have blocks of memory are copied the 386-PPro-PII way.
Effect of byte swapping
Byte swapping takes place in two cases: header info when trading packets
through a network with a different endian machine and addressing info for
SCSI peripherals. In both cases the content (eg what you write to an SCSI
disk) is not changed. The only effect is on headers/control info and that
is only a minimal part of the CPU time spent for networking/SCSI activity
so it has no noticeable effect on performance.
Effects of selective invalidation of TLB
We will explain some basics about VM and address translation. When given
an address the CPU will first look into a page directory, and later into
a page table in order to translate the virtual address into a real address
before being able to access the data. That means a threefold slowdown because
there are three accesses to memory instead of one. In fact it could be
much more than that in case the page table entries are in slow regular
RAM while the real data is in the much faster cache. To avoid this the
CPU keeps a list of the last accessed pages and of their translations into
an internal ultra-fast memory called the TLB (translation lookaside buffer).
Now suppose the kernel wants to unmap a page belonging to a process, it
will modify the page tables but the problem is they are no longer in sync
with the TLB so if the CPU finds the adress in TLB it will not look at
the page tables and will use the wrong data. Therefore the kernel needs
to tell the CPU to avoid using the TLB entry, but 386s don't support selective
invalidation of TLB entries so the kernel invalidates the whole TLB. Now
the kernel you get with your distribution has to be able to work with 386s
as well as newer processors so they are compiled to use total TLB invalidation
and that means if you are using a newer processor you lose the benefits
of selective invalidation.
Let's look now at the circumsatnces where selective TLB invalidation
has a significant effect and let's quantify the slow down.
First of all if the kernel unmaps a page and then handles control to
another process it will reload CR3 and that will cause a total TLB invalidation
(different processes have entirely different mappings) so you get any benefit
only if control is handled back to the same process either immediately
or after some time in kernel mode. Also consider that time wasted due to
entire TLB invalidation is some microseconds while disk IO takes 10 milliseconds
in best case that is one thousand times more. That means in case there
is disk IO following this unmapping (due to swap out) benefits would be
In fact about the only case where selective TLB will be meaningful would
be in the following scenario: process frees memory so the kernel will invalidate
TLB, it handles control to the same process and then the process scans
a large array doing only a single access for every entry, then just when
the TLB is fully reloaded, it unmaps memory again, new TLB invalidation,
kernel gives back control again and then the process scans the same array
entries. Highly theorical and don't forget that during the second pass
page entries will be in cache so address translation will be much faster
and this will reduce benefits got due to selective TLB invalidation.
Let's evaluate what happens in a normal process. We will arbitrarily
assume this process runs for one tick (10 ms) after the unmapping.
For everything else we will take the worst case. The slower the memory
the more costly is translation so we will assume this computer uses 60
ms DRAM instead of SDRAM. The larger the TLB the bigger the benefits of
selective invalidation so we will choose a CPU with a big TLB in our case
it will be an AMD K6 model 7: it has a 64 entry TLB for code pages and
a 128 entry TLB for data pages. We will also assume that we never find
nor page table entries nor page directory entries in cache (the later is
very irrealistic because a single directory entry is used every 4 Megs
of address space) so every translation will need 2x60=120 ns so the complete
refilling of the TLB needs 120 ns * 192 TMB entries = 23 microseconds.
Because we assumed the process would be running for a whole tick that means
the slowdown due to address translation is only 0.2 per cent.
Effects of tuning GCC options
Precise measuring of kernel timing is quite difficult, in addition the
kernel is a mix of C and assembler. What will we do will be to recompile
the Byte benchmark using GCC 126.96.36.199 with the same flags used in 2.0 kernels
both for 386s (the one used for native kernels in distributions) and for
Pentiums and above (486 is an intermediary case). However those benchmarks
will give us a good idea, with perhaps a bias towards overestimation because
the Byte benchmarks are pure C so the compiler gains will be felt in full
while the kernel is a mix of C and assembler the later being unaffected
by compiler optimizations.
The benchmarks were run in two computers: a Pentium 75 and an AMD K6-300.
The Pentium tuned test was effectively faster than the 386 tuned test ...
by a mere 1.8% on the P75, about the same in the AMD. The conclusions to
be drawn is that GCC 2.7 for the x86 family has little model-dependent
optimizations nor are the alignment optimizations particularly effective.
Those paltry TWO percent (rounded UP) is all you get when you listen to
the words of wisdom dispensated in magazines.
If you are an expert and have a spare machine for experimenting then
you could try recompilings using more agressive optimizations than the
standard -O2 or using a better compiler than gcc 2.7 like egcs or pgcc.
However be warned that all 2.0 kernels until 2.0.35 and possibly 2.0.36
have some bugs who will break the kernel with any other compiler than gcc
2.7 (they work due to gcc 2.7 bugs). Also be wary about some optimizations
like loop-unrolling who according to egcs or pgcc doc were never thorougly
tested be in gcc, egcs or pgcc and that egcs and pgcc are not as well tested
as gcc (egcs 1.0 was notorious for its FP bugs). Given these warnings there
is a 7% speed difference between the Byte benchmarks compiled with -O6
and loop-unrolling against plain -O2. So playing with compiler and compiler
flags is an interesting possibility if you are an expert: it could help
the kernel developpers to determine what are the more agresive optimizations
who don't break the kernel. If you are not an expert then don't lose sleep
about this. The problem is that only a small part of the time spent
by your program will be spent executing those parts of kernel code affected
Now remember that if your process spends only 10% of its time in kernel
parts written in C then recompiling the kernel with a compiler generating
30% faster code will only provide a 3% speed increase in the overall performance.
If your program spends 90% of its CPU time in user mode then kernel optimizations
will be hardly felt.
Compiler optimizations will have no effect whenever the kernel runs parts
written in assembler.
Many kernel-intensive processes are in fact IO-bound: the CPU waits for
the peripheral. That means that if there is only one active process the
kernel will end its job earlier and will wait a bit longer until the disk
is ready. In that case you will get any benefit only if you have two active
processes: the speed increase in the kernel will allow running the other
process until it gets the answer of the peripheral.
Consider also that there are some peripherals (notoriously some broken
IDE disks) who force the kernel to enter active loops until it gets the
answer of the peripheral. That means that recompiling your
kernel will only affect the number of times the kernel executes the loop.
Two cases were the kernel spends time doing pure CPU are pipe data transfers
and disk reading when data is found in cache. This should benefit from
tuning the compiler flags were it not that data transfer is done in assembler
and will not be affected by compiler magic.
Kernel recompiling for your specific processor gives only a minimal
CPU boost when the kernel version is 2.0 and the processor is a 1998 or
earlier model of the i386 architecture. This could change in
future versions of Linux or when using newer processors.
Advice and conclusions
Kernel compiling is not presently an effective way to optimize a Linux
box. Don't do it if it frightens you. At most, because it is easy and relatively
safe, prepare a rescue floppy, ensure you can boot from it and then recompile
changing only two things: processor type and disable FPU emulation if you
have one (do a cat /proc/cpuinfo if you don't know). With most distributions
you will get exactly the same drivers your distribution kernel was compiled
(keep a backup of the original modules just in case).
Kernel compiling has been seen as the panacea for Linux optimization.
Unfortunately this doesn't resist serious analysis. It also has two serious
drawbacks. First it is poor public relations for spreading Linux between
normal people. Second this has sterilized investigation of more effective
The people advocating other solutions will use kernel compiling as
an argument against Linux. Let's kill this myth.
Some broken IDE disks absorb 90% of CPU time when data tranfer is taking
place, tuning them with hdparms can reduce this to 20%. But tuning
hdparms is very dangerous and everyone who has used has suffered massive
data corruption at least once. Never use it unless you can backup your
disks or perform your tests having a single partition mounted and that
one being expendable. But if half the energy who has been spent in
kernel compiling had been spent on hdparms we would have a data base specifying
what settings can be safely used according to disk and chipset model.
Little has been written about to the placement of swap partitions, however
smart placement of them can shorten the moves of the disk arm. In addition
if you have two or more disks you can play with swap partition priorities
in order to get your pages being spread evenly between two disks thus doubling
transfer rate. You can also try placing your partition in a different disk
than Linux itself.
Your kernel can be tuned by writing in files under /proc/sys. Problem is
we have had little experimentation for finding the right values. In fact
few people know about this. Again emphasis on kernel compiling has precluded
serious investigation about it.
Copyright © 1999, Jean Francois Martinez
Published in Issue 37 of Linux Gazette, February 1999