...making Linux just a little more fun!
Many months ago I had a few drinks with some fellow hackers. The discussion touched on the inevitable "my-Linux-is-better-than-your-*BSD-and-vice-versa" topic. I tried to argue in favour of Linux because it knows multiple TCP variants. They corrected me and said that there is only one TCP and that *BSD speaks it very well. Days after the side effects of the drinks were gone, I remembered the discussion and felt a sense of l'esprit d'escalier. Of course, there is a clear definition what the famous Tricky Communication Protocol (TCP) looks like, but there are many ways of dealing with situations in Wide Area Networks (WANs) where congestion, round-trip times, and packet loss play a major role. That's what I wanted to say. Enter Linux's pluggable congestion control algorithms.
Every time you wish to transport data in a reliable way, you probably use TCP (or your car, but let's stick to networks). TCP is able to transport large amounts of data in the correct order over unreliable network links. The protocol keeps track of sent data, detects lost packets, and retransmits them if necessary. This is done by acknowledging every packet to the sender. Basically, this is how TCP works. My article would end here were it not for some parameters of TCP connections that make things very interesting and complicated. (To some, those attributes are the same.) The first perturbation stems from the network itself, and materialises in the form of three parameters that pertain to every TCP connection.
How does TCP manage the parameter settings? Fortunately the protocol has some auto-tuning properties. One important parameter is the TCP window. The window is the amount of data the sender can send before receiving an acknowledgement. This means that, as long as the window is "not full", the sender can blindly keep sending packets. As soon as the window is filled, the sender stops sending, and waits for acknowledgement packets. This is TCP's main mechanism of flow control, that enables it to detect bottlenecks in the network during a transmission. That's why it's also called the congestion window. The default window size varies. Usually Linux starts at 32 kB. The window size is flexible, and can be changed during the transfer. This mechanism is called window scaling. Starting with Linux kernel 2.6.17, the window can be up to 4 MB. The window can grow only as long as there is no packet loss. That's one of the reasons why data throughput is reduced on lossy network links.
TCP has some more interesting features, but I won't go into more details. We now know enough, for the features of the Linux kernel I wish to describe.
TCP is able to transport large amounts of data in the correct order over unreliable network links. The protocol keeps track of sent data, detects lost packets, and retransmits them if necessary.
The right size of the TCP window is critical to efficient transmission. The tricky part is to find that right size. Either a too-small and a too-big window will degrade throughput. A good guess is the use the bandwidth-delay product. Furthermore, you can base estimates on periodically measured RTT or packet loss, and make it dynamic. This is why several researchers have explored algorithms to help TCP tune itself better, under certain circumstances. Beginning with 2.6.13, the Linux kernel supports plugins for the TCP stack, and enables switching between algorithms depending on what the system is connected to. The first strategy on the Internet was TCP Tahoe, proposed in 1988. A later variant was TCP Reno, now widely implemented in network stacks, and its successor TCP NewReno. Currently, the Linux kernel includes the following algorithms (as of kernel 184.108.40.206):
Switching between the different algorithms can be easily done, by writing text to a /proc/ entry.
nightfall:~# echo "westwood" > /proc/sys/net/ipv4/tcp_congestion_control nightfall:~# cat /proc/sys/net/ipv4/tcp_congestion_control westwood nightfall:~#A list of available modules can be found here:
nightfall:~# ls /lib/modules/`uname -r`/kernel/net/ipv4/ ip_gre.ko netfilter tcp_cubic.ko tcp_htcp.ko tcp_lp.ko tcp_vegas.ko ipip.ko tcp_bic.ko tcp_highspeed.ko tcp_hybla.ko tcp_scalable.ko tcp_veno.ko nightfall:~#When writing to /proc/, you can skip the tcp_ prefix. If you compile your own kernels, you will find the modules in the Networking -> Networking options -> TCP: advanced congestion control section. Since some of the algorithms affect only the sender's side, you may not notice a difference when enabling them. In order to see changed behaviour, you have to create a controlled setup, and measure the parameters of TCP transmissions.
Now that we know how to control the congestion algorithms, all we need is a bottleneck. As long as your Linux box(es) are connected to the local Ethernet with 100 Mbit/s or more, all packets will be sent immediately, and your local queues are always empty. The queues end up on your modem or gateway, which may treat them differently from the way Linux does. Even if you use a Linux gateway, there won't be a queue there, because very often this gateway connects via Ethernet to another device that handles the Internet connection. If you want to know what Linux does with queues, you need to set up a controlled bottleneck between two or more Linux boxes. Of course, you can try to saturate your LAN, but your core switch and your colleagues might not be impressed.
You can use two separate routers and connect them with serial crossover cables. You can set up two Linux boxes as routers, and create the bottleneck link with USB 1.1 crossover cabling, serial links, Bluetooth devices, 802.11b/g WLAN equipment, parallel links, or slower Ethernet cards (10 Mbit/s, for example). You could also do traffic shaping in-between, and reduce the bandwidth or increase the latency. The Linux kernel has a nice tool for doing this. It's called netem, which is short for Network Emulator. Recent distributions have it. In case you compile your own kernels, you'll find it here:
Networking --> Networking Options --> QoS and/or fair queuing --> Network emulatorAdditionally, you need to enable the Advanced Router config option. netem can be activated by the tc tool from the iproute package. It allows you to
tc qdisc add dev eth1 root netem corrupt 2.5% delay 50ms 10ms tc qdisc add dev eth0 root netem delay 100ms reorder 25% 50%This means that network device eth1 corrupts 2.5% of all packets passing through it. Additionally, all packets get delayed by 50ms 10ms. Device eth1 introduces a delay of at least 100 ms for reordered packets. 25% of the packets will be sent immediately. The 50% indicates a correlation between random reordering events. You can combine different netem options. Make sure that you enter only one tc command for every network device with all options and parameters attached. netem is a network queue, and can only be installed once for every network device. You can remove it any time you want, though.
After your new and shiny bottleneck works, you can start moving data from one end to the other. Every program that speaks TCP can be used for that. iperf, netpipe, any FTP client, netcat, or wget are good tools. Recording the network flows with wireshark and to postprocess them with tcptrace is also an option. wireshark has some features that allow you to analyse a TCP connection such as creating a graph of RTT values, sequence numbers, and throughput -- though a real evaluation for serious deployment must use a finer resolution and a better statistic than simple progress bars.
You can also do in-kernel measurement by using the module tcpprobe. It is available when the kernel is compiled with kernel probes that provide hooks to kernel functions (called Kprobe support). tcpprobe can be enabled by loading it with modprobe and giving a TCP port as module option. The documentation of tcpprobe features a simple example:
# modprobe tcpprobe port=5001 # cat /proc/net/tcpprobe >/tmp/data.out & # pid=$! # iperf -c otherhost # kill $pidtcpprobe gets loaded and is bound to port 5001/TCP. Reading /proc/net/tcpprobe directly gives access to the congestion window and other parameters of the transmission. It produces one line of date in text format for every packet seen. Using port 0 instead of 5001 allows measuring the window of all TCP connections to the machine. The Linux network people have more tips for testing TCP in their wiki. Also, be sure to familiarise yourself with the methods and tools, in case you plan to do some serious testing.
Keep in mind that TCP really is a can of worms that get very complex, and that the developers of all Free operating systems deserve a lot of recognition and thanks for dealing with these issues.
Most of the new TCP algorithms incorporated into the Linux kernel were designed for very specific purposes. The modules are no magic bullet. If you are connected to a 56k modem, then no congestion algorithm in the Universe will give you more bandwidth than your modem can handle. However, if multiple TCP flows have to share the same link, then some algorithms give you more throughput than others. The best way to find out is to create a test environment with defined conditions, and make comparisons on what you see. My intention was to give you an overview of what's going on in the kernel, and how flexible Linux is when dealing with variable network environments. Happy testing!
No packets were harmed while preparing this article. You might wish to take a look at the following tools and articles suitable to save your network link and deepen the understanding of what the Linux kernel does with your packets.
The only reason that I mentioned the religious Linux/*BSD discussion is the fact of being at a party and talking with friends. I don't wish to imply that one OS is better than the other. You can solve and create all your problems with both systems. Keep in mind that TCP really is a can of worms that get very complex, and that the developers of all Free operating systems deserve a lot of recognition and thanks for dealing with these issues.
René was born in the year of Atari's founding and the release of the game Pong. Since his early youth he started taking things apart to see how they work. He couldn't even pass construction sites without looking for electrical wires that might seem interesting. The interest in computing began when his grandfather bought him a 4-bit microcontroller with 256 byte RAM and a 4096 byte operating system, forcing him to learn assembler before any other language.
After finishing school he went to university in order to study physics. He then collected experiences with a C64, a C128, two Amigas, DEC's Ultrix, OpenVMS and finally GNU/Linux on a PC in 1997. He is using Linux since this day and still likes to take things apart und put them together again. Freedom of tinkering brought him close to the Free Software movement, where he puts some effort into the right to understand how things work. He is also involved with civil liberty groups focusing on digital rights.
Since 1999 he is offering his skills as a freelancer. His main activities include system/network administration, scripting and consulting. In 2001 he started to give lectures on computer security at the Technikum Wien. Apart from staring into computer monitors, inspecting hardware and talking to network equipment he is fond of scuba diving, writing, or photographing with his digital camera. He would like to have a go at storytelling and roleplaying again as soon as he finds some more spare time on his backup devices.