Linux based SMB NAS System Performance Optimization
SMB NAS systems range from the low end, targeting small offices (SOHO) that might have 2 to 10 servers, requiring 1 TB to 10 TB of storage, to the high end, targeting the commercial SMB (Small to Medium Business) market. Customers are demanding a rich feature set at a lesser cost, given the relatively cheap solutions available in the form of disk storage, compared to what was available 5 years ago. One of the key factors in deciding the unique selling proposition of the NAS (Network Attached Storage) in the SOHO market is the network and storage performance. Due to extremely competitive pricing, many NAS vendors are focused on reducing the manufacturing costs, so one of the challenges is dealing with the scarcity of system hardware resources without compromising the performance. This article describes the optimizations (code optimizations, parameter tuning, etc.) required to support the performance requirements in a resource-constrained Linux based SMB NAS system.
A typical SMB NAS hardware platform contains the following blocks:
1. 500MHz RISC processor
2. 256 MB RAM
3. Integrated Gigabit Ethernet controller
4. PCI 2.0
5. PCI-to-Serial ATA Host Controller
We broadly target the network and storage path in the kernel to improve the performance. While focusing on the network path optimization, we look at device level, driver level and stack level optimization sequentially and with decreasing order of priority. Before starting the optimization, the first thing to measure is the current performance for CIFS, NFS and FTP accesses. We can use tools such as IOzone, NetBench or bonnie++ to characterize the NAS box for network and storage performance. The system level bottlenecks can be identified using tools like Sysstat, LTTng or SystemTap that make use of less intrusive instrumentation techniques. OProfile is another tool which can be helpful to understand the functional bottlenecks.
1. Networking Optimization
1.1. Device level optimization
Many hardware manufacturers typically add a lot of bells and whistles in their hardware for accelerating specific functions. The key ones that have an impact on the NAS system performance are described below.
1.1.1 Interrupt Coalescence
Most of the gigabit Ethernet devices support interrupt coalescence, which when clubbed with NAPI will provide an improved performance. (Please refer to driver level optimization later). Choosing the coalescence value for the best performance is a purely iterative job and highly dependent on the target application. For the systems which are targeted for excessive network load of large packets, specifically jumbo frames, and CPU intensive applications or CPUs with lesser frequency, this value plays a significant role in reducing the frequent interruptions of the CPU. Even though there is no rule of thumb to decide the coalescence value, keeping the following points in mind for reducing the number of interrupts will be helpful:
- For the systems with the memory size of 256-512 MB, it helps to choose the smaller value of the coalescence, clubbed with a smaller budget value (The budget parameter specifies how many packets the driver is allowed to pass to the network stack on a poll() call). Note that the coalescence value also depends upon the type of the interface of the device, viz. Gigabit or Fast Ethernet. For a Gigabit Ethernet device and 256 MB memory size, interrupt coalescence timeout value of 64 will be helpful.
- Driver budget parameter will decide the buffering capacity which needs to be set, keeping in mind the coalescence value, and this could also be kept at 64 for the above configuration.
1.1.2 MAC Filters
Enabling MAC filters in the device would greatly offload CPU processing of non-required packets and help to drop the packets at the device level itself.
1.1.3 Hardware Checksum
Most of the devices support hardware checksum calculation. Setting CHECKSUM_HW as the CRC calculation will greatly offload the CPU from the CRC computation and rejection of packets if there is a mismatch. Note that this is only the physical layer CRC and not the protocol level.
1.1.4 DMA support
The DMA controller tracks the buffer address and ownership bits for the transmit and receive descriptors. Based on the availability of buffers in the memory, the DMA controller will transfer the packets to and from system memory. This is the standard behavior of any Ethernet device IP. The performance can be improved by moving the descriptors to onchip SRAM if available in the processor. As transmit and receive descriptors would be accessed most frequently, moving them to the SRAM will help with faster access to the buffers.
1.2 Driver level optimization
This optimization scope is vast and needs to be treated based on the application requirement. The treatment can be different if the need is to optimize only for packet forwarding performance, rather than TCP level performance. Also, it varies based on the system configuration and resources viz, other important applications making use of the system resources, DMA channels, system memory and whether the Ethernet device is PCI based or directly integrated to the System- on-Chip (SoC) through the system bus. In the DUT, the Ethernet device was hooked up directly to the system bus through the bridge so we'll not focus on PCI based driver optimization for now, although to some extent, this can be applied to PCI devices as well. Most of the NAS systems would be subject to heavy stress when they are up and running for several days and a lot of data transfer is going on over the period. Specifically, with the Linux based NAS systems the free buffer count keeps on reducing over the period and as the pdflush algorithm is not optimized in the kernel for this specific use case, and given that the cache flush mechanism is left to the individual file systems, it depends upon how efficiently the flushing algorithm is implemented in the filesystem. It also depends upon how efficient the underlying hardware is, viz. SATA controller, the disks used etc.Following factors need to be considered when tweaking the Ethernet device driver:
1.2.1 Transmit interrupts
We can mask off transmit interrupts and avoid interrupt jitter occurring because of transmission of the packets. If the transmit gets blocked, there is not much performance impact on the overall network process as such, and what matters is only the receive interrupts.
1.2.2 Error interrupts
Disable the error interrupts in the device unless you really want to take some action based on the error occurred and pass the status up to the application level to let it know that there is some problem in the device.
1.2.3 Memory alignment for descriptors
Even though there are some gaps added through padding in the descriptors, it always helps to have efficient access to the descriptors when they are cache line size aligned and bus width aligned.
1.2.4 Memory allocation for the packets
You may want to make a trade off for memory allocation in the receive path, viz. either choose to pre-allocate all the buffers, preferably fixed sized, fixed number of buffers and use them recursively or make dynamic allocations of buffers on a need basis. Each has its own advantages and disadvantages. In the prior case, you will consume memory and other applications in the system can't make use of the buffers. This may lead to internal fragmentation of the buffers as well as not knowing the exact size of the packet to be received. But it will significantly reduce the memory allocation and deallocation of the buffers in the receive path. As we rely on kernel memory allocation / deallocation routines, there is a possibility that we may end up starving or looping in the receive context to get the free buffers, or if the required memory is not available then it may end up in page allocation failures. Dynamic memory allocation will significantly save on memory in a memory constrained system. This will help other applications to run freely without any issues even when the system is exposed to the heavy stress of IO. But you will have to pay the penalty in the time required to allocate and free the buffers at runtime.
1.2.5 SKB Recycling
Socket buffers(skb) are allocated and freed by the driver on arrival of packets in standard Ethernet drivers. By implementing SKB recycling, the sockets are pre-allocated and would be provided to the driver when requested and would be put back in the recycle queue when freed in kfree implementation.
1.2.6 Cache coherency
Cache coherency is usually not supported in most of the hardware to reduce the BOM, but if supported, it makes the system highly performance driven as software doesn't need to invalidate the cache line and flush the cache. Especially under heavy IO stress, this can have a worse impact on the network performance, which will get propagated to the storage stack as well.
If the underlying CPU supports prefetching, it can help to dramatically improve the system performance under heavy stress. Specifically under heavy stress, this helps to improve the performance by avoiding cache misses. Note that the cache needs to be invalidated if the coherency is not supported in the hardware, else it will greatly hamper the performance.
1.2.8 RCU locks
Spin locks / IRQ locks are expensive compared to light weight RCU locks. The lock/unlock mechanism of RCU locks is much lighter than the spin locks and helps a lot in performance improvement.
1.3 Stack level optimization
TCP/IP stack parameters are defined considering all the supported protocols in the network. Based on the priority of the path followed, we can choose the parameters as per our need. For example, NAS uses CIFS, NFS protocols primarily for the data transfer.
1.3.1 TCP buffer sizes
We could choose the following parameters:
net.core.rmem_max = 64K net.core.rmem_default = 109568
While this would provide ample space for the TCP packets to get queued up in the memory and yield better performance, it eats up the system memory, and therefore the memory required by other applications in the system to run smoothly needs to be considered. With low system memory, following options can be tried out, to check the performance improvement. Again these changes are highly application dependent and would not necessarily yield a similar performance improvement as observed in our DUT.
1.3.2 Virtual Memory (VM) parameters
This parameter decides when pdflush will kick in to flush out the kernel cache to the disk. Setting this value high will spawn the pdflush threads frequently to flush out the data to the disks. This will help with freeing the buffers for the rest of the applications, when the system is exposed to heavy stress. The flip side is that, this may end up causing the system to thrash and consume CPU cycles for freeing the buffer more frequently, and hence dropping packets.
126.96.36.199 pdflush threads
The default number of threads in the system is 2. By increasing the number to 4, we can achieve better performance for NFS. This will help to flush the data promptly and not get queued up for long, thus making space available for free buffers for other applications.
2. Storage Optimization
Storage optimization highly depends upon the filesystem used along with the characteristics of the storage hardware path i.e. the type of the interface, viz. whether it has integrated SATA controller or over PCI. It also depends on whether it has hardware RAID controller or software RAID controller. Hardware RAID controller increases the BOM dramatically. Hence for SOHO solutions, software RAID manager "mdam" is used. In addition to HW support for RAID functionality, the performance depends upon the software IO path which includes the block device layer, device driver and filesystem used. NAS usually has journaling file systems such as ext3 or XFS. XFS is more commonly used for sequential writes or reads due to its extent based architecture, whereas ext3 is well suited for sector wise reads or writes which work on 512 bytes rather than big chunks. We used XFS in our system. The flip side of the XFS is the fragmentation which will be discussed in the next section. The following areas were tweaked for performance improvements in the storage path:
2.1 Software RAID manager
There are two crucial parameters which can decide the performance of the RAID manager.
2.1.1. Chunk size
We should set the chunk size to 64K which will help in mirroring redundancy application, viz. RAID1 or RAID10
2.1.2. Stripe Cache size
Stripe cache size decides the stripe size per disk. Stripe size should be judiciously decided as it will decide the data that needs to be written in per disk of the RAID controller. In case of journaled file system, there will be duplicate copy of the disk and would bog down the IO if the chunk and stripe size value is high.
2.2 Memory fragmentation
Due to its architecture, XFS requires memory in multiples of extents. This leads to severe internal fragmentation if the blocks are of smaller sizes. XFS demands the memory in multiples of 4K and if the buddy allocator doesn't have enough room in order 2 or order 3 hash, then the system may slow down till the other applications release the memory and the kernel can join them back in the buddy allocator. This will specifically be seen under heavy stress, after the system has been running for a couple of days. This can be improved by tweaking the xfs syncd centisecs (see below) to flush the stale data to disks at higher frequency.
2.3 External journal
Journal or log plays an important role in defining XFS performance. Any journaled file system doesn't write the data directly at the destined sectors, but maintains the copy in the log and later writes to the sectors. To achieve better performance, XFS writes the log at the center of the disk whereas the data is stored in the tracks at the outer periphery of the disk. This helps the actuator arm of the disk to have reduced spindle movements, thereby increasing performance. If we remove the dependency of maintaining the log in the same disk where the actual data is stored, it increases performance dramatically. Having NAND flash of 64M to store the log would certainly help reduced spindle movements.
2.4 Filesystem optimization
2.4.1. Mount options
While mounting the filesystem, we can choose the mount options as logsize=64M,noatime,nodiratime. Removing the access time for files and directories would help relieve the CPU from continuously checking the inodes for files and directories.
This is the frequency at which XFS flushes the data to disk. It will flush the log activity and clean up the unlinked inodes. Setting this parameter to 500 will help increase the frequency of flushing the data. This would be required if the hardware performance is limited. Especially when the system is exposed to constant writes, this helps a lot.
This is the frequency at which XFS cleans up the metadata. Keeping it low would lead to frequent cleanup of the metadata for extents which are marked as dirty.
This is the frequency at which xfsbufd flushes the dirty metadata to the disk. Setting it to 500 will lead to frequent cleanup of the dirty metadata.
The parameters listed in b,c and d above will lead to significant performance improvement when the log is kept in a separate disk or flash, compared to keeping it in the same media as the data.
2.5 Random disk IO
Simultaneous writes at non-sequential sectors would lead to a lot of actuator arm and spindle movement, inherently leading to the performance drop and impact on the overall throughput of the system. Although there is very little we can do here to avoid random disk IO, what we can control is the fragmentation.
Any journaled filesystem will always have this issue when the system is running over a long period of time or exposed to heavy stress. XFS provides a utility named xfs_fsr which can be run periodically to reduce the fragmentation in the disk, although it works only on the dirty extents which are currently not in use. In fact, random disk IO and the fragmentation issue go hand in hand. More the fragmentation more will be the random disk IO and lesser the throughput. It is of utmost importance to keep the fragmentation under control to reduce the random IO.
We can easily figure out that when the system is under heavy stress, the amount of time taken by the kernel for packet processing is more than the user space processing for the end-to-end path, i.e. from Ethernet device to hard disk.
After implementing the above features, we captured the performance figures on the same DUT.
For RAID creation, use '--chunk=1024'; for RAID5, use
echo 8192 > /sys/block/md0/md/stripe_cache_size
Around 37% degradation in performance was observed after 99% fragmentation of the filesystem, as compared to when the filesystem was 5% fragmented. Note that these figures were captured using the bonnie++ tool.
2.7.2 Having an external log gives around 28% improvement in performance compared to having the log in the same disk where the actual data is stored, on a system that was 99% fragmented.
2.7.3 Running the defragmentating utility xfs_fsr on a system that is 99% fragmented gave an additional 37% improvement in performance.
2.7.4 CIFS write performance for a 1G file on a system that was exposed to 7 days of continuous writes, increased by 133% after implementing the optimizations. Note that this is end to end performance that includes network and storage path.
2.7.5. The NFS write performance was measured using IOzone. NFS write performance for a 1G file on a system that was exposed to 7 days of continuous writes, increased by 180% after implementing the optimizations. Note again that this is the end to end performance that includes network and storage path.
While Linux based commercial SMB NAS systems have the luxury of throwing more powerful hardware to address the performance requirements, SOHO NAS systems have to make do with resource constrained hardware platforms, in order to meet the system cost requirements. Although Linux is a general purpose OS that has a lot of code to run a wide variety of applications, and is not necessarily purpose built for an embedded NAS server application, it is possible to configure and tune the OS to obtain good performance for this application. Linux has a lot of knobs that can be tweaked to improve system performance for a specific application. The challenge is in figuring out the right set of knobs that need to be tweaked. Using various profiling and benchmarking tools, the performance hotspots and the optimizations to address the same, for a NAS server application have been identified and documented.
I got introduced to Linux through the very well-known Tanenbaum vs. Torvalds debate. Till that time, all I'd heard was that Linux was just another Unix-like operating system. But looking at the immense confidence of a young college graduate who was arguing with the well-known professor, Minix writer, and network operating systems specialist got me interested in learning more about it. From that day forward, I couldn't leave Linux alone - and won't leave it in the future either. I started by understanding the internals, and the more I delved into it, the more gems I discovered in this sea of knowedge. Open Source development really motivates you to learn and help others learn!
Currently, I'm actively involved in working with the 2.6 kernel and specifically the MIPS architecture. I wanted to contribute my 2 cents to the community - whatever I have gained from understanding and working with this platform. I am hoping that my articles will be worthwhile for the readers of LG.
Raj Palani works as a Senior Manager, Software Engineering at EMC. He has been designing and developing embedded software since 1993. His involvement with Linux development spans more than a decade.