HOWTO: Multi Disk System Tuning: Technologies

6. Technologies

In order to decide how to get the most of your devices you need to know what technologies are available and their implications. As always there can be some tradeoffs with respect to speed, reliability, power, flexibility, ease of use and complexity.

Many of the techniques described below can be stacked in a number of ways to maximise performance and reliability, though at the cost of added complexity.

6.1 RAID

This is a method of increasing reliability, speed or both by using multiple disks in parallel thereby decreasing access time and increasing transfer speed. A checksum or mirroring system can be used to increase reliability. Large servers can take advantage of such a setup but it might be overkill for a single user system unless you already have a large number of disks available. See other documents and FAQs for more information.

For Linux one can set up a RAID system using either software (the md module in the kernel), a Linux compatible controller card (PCI-to-SCSI) or a SCSI-to-SCSI controller. Check the documentation for what controllers can be used. A hardware solution is usually faster, and perhaps also safer, but comes at a significant cost.

A summary of available hardware RAID solutions for Linux is available at Linux Consulting.

SCSI-to-SCSI

SCSI-to-SCSI controllers are usually implemented as complete cabinets with drives and a controller that connects to the computer with a second SCSI bus. This makes the entire cabinet of drives look like a single large, fast SCSI drive and requires no special RAID driver. The disadvantage is that the SCSI bus connecting the cabinet to the computer becomes a bottleneck.

A significant disadvantage for people with large disk farms is that there is a limit to how many SCSI entries there can be in the /dev directory. In these cases using SCSI-to-SCSI will conserve entries.

Usually they are configured via the front panel or with a terminal connected to their on-board serial interface.

Some manufacturers of such systems are CMD and Syred whose web pages describe several systems.

PCI-to-SCSI

PCI-to-SCSI controllers are, as the name suggests, connected to the high speed PCI bus and is therefore not suffering from the same bottleneck as the SCSI-to-SCSI controllers. These controllers require special drivers but you also get the means of controlling the RAID configuration over the network which simplifies management.

Currently only a few families of PCI-to-SCSI host adapters are supported under Linux.

DPT

The oldest and most mature is a range of controllers from DPT including SmartCache I/III/IV and SmartRAID I/III/IV controller families. These controllers are supported by the EATA-DMA driver in the standard kernel. This company also has an informative home page which also describes various general aspects of RAID and SCSI in addition to the product related information.

More information from the author of the DPT controller drivers (EATA* drivers) can be found at his pages on SCSI and DPT.

These are not the fastest but have a good track record of proven reliability.

Note that the maintenance tools for DPT controllers currently run under DOS/Win only so you will need a small DOS/Win partition for some of the software. This also means you have to boot the system into Windows in order to maintain your RAID system.

ICP-Vortex

A very recent addition is a range of controllers from ICP-Vortex featuring up to 5 independent channels and very fast hardware based on the i960 chip. The Linux driver was written by the company itself which shows they support Linux.

As ICP-Vortex supplies the maintenance software for Linux it is not necessary with a reboot to other operating systems for the setup and maintenance of your RAID system. This saves you also extra downtime.

Mylex DAC-960

This is one of the latest entries which is out in early beta. More information as well as drivers are available at Dandelion Digital's Linux DAC960 Page.

Compaq Smart-2 PCI Disk Array Controllers

Another very recent entry and currently in beta release is the Smart-2 driver.

IBM ServeRAID

IBM has released their driver as GPL.

Software RAID

A number of operating systems offer software RAID using ordinary disks and controllers. Cost is low and performance for raw disk IO can be very high. As this can be very CPU intensive it increases the load noticeably so if the machine is CPU bound in performance rather then IO bound you might be better off with a hardware PCI-to-RAID controller.

Real cost, performance and especially reliability of software vs. hardware RAID is a very controversial topic. Reliability on Linux systems have been very good so far.

The current software RAID project on Linux is the md system (multiple devices) which offers much more than RAID so it is described in more details later.

RAID Levels

RAID comes in many levels and flavours which I will give a brief overview of this here. Much has been written about it and the interested reader is recommended to read more about this in the Software RAID HOWTO.

RAID 0 is not redundant at all but offers the best throughput of all levels here. Data is striped across a number of drives so read and write operations take place in parallel across all drives. On the other hand if a single drive fail then everything is lost. Did I mention backups?
RAID 1 is the most primitive method of obtaining redundancy by duplicating data across all drives. Naturally this is massively wasteful but you get one substantial advantage which is fast access. The drive that access the data first wins. Transfers are not any faster than for a single drive, even though you might get some faster read transfers by using one track reading per drive. Also if you have only 2 drives this is the only method of achieving redundancy.
RAID 2 and 4 are not so common and are not covered here.
RAID 3 uses a number of disks (at least 2) to store data in a striped RAID 0 fashion. It also uses an additional redundancy disk to store the XOR sum of the data from the data disks. Should the redundancy disk fail, the system can continue to operate as if nothing happened. Should any single data disk fail the system can compute the data on this disk from the information on the redundancy disk and all remaining disks. Any double fault will bring the whole RAID set off-line. RAID 3 makes sense only with at least 2 data disks (3 disks including the redundancy disk). Theoretically there is no limit for the number of disks in the set, but the probability of a fault increases with the number of disks in the RAID set. Usually the upper limit is 5 to 7 disks in a single RAID set. Since RAID 3 stores all redundancy information on a dedicated disk and since this information has to be updated whenever a write to any data disk occurs, the overall write speed of a RAID 3 set is limited by the write speed of the redundancy disk. This, too, is a limit for the number of disks in a RAID set. The overall read speed of a RAID 3 set with all data disks up and running is that of a RAID 0 set with that number of data disks. If the set has to reconstruct data stored on a failed disk from redundant information, the performance will be severely limited: All disks in the set have to be read and XOR-ed to compute the missing information.
RAID 5 is just like RAID 3, but the redundancy information is spread on all disks of the RAID set. This improves write performance, because load is distributed more evenly between all available disks.

There are also hybrids available based on RAID 0 or 1 and one other level. Many combinations are possible but I have only seen a few referred to. These are more complex than the above mentioned RAID levels.

RAID 0/1 combines striping with duplication which gives very high transfers combined with fast seeks as well as redundancy. The disadvantage is high disk consumption as well as the above mentioned complexity.

RAID 1/5 combines the speed and redundancy benefits of RAID5 with the fast seek of RAID1. Redundancy is improved compared to RAID 0/1 but disk consumption is still substantial. Implementing such a system would involve typically more than 6 drives, perhaps even several controllers or SCSI channels.

6.2 Volume Management

Volume management is a way of overcoming the constraints of fixed sized partitions and disks while still having a control of where various parts of file space resides. With such a system you can add new disks to your system and add space from this drive to parts of the file space where needed, as well as migrating data out from a disk developing faults to other drives before catastrophic failure occurs.

The system developed by Veritas has become the defacto standard for logical volume management.

Volume management is for the time being an area where Linux is lacking.

One is the virtual partition system project VPS that will reimplement many of the volume management functions found in IBM's AIX system. Unfortunately this project is currently on hold.

Another project is the Logical Volume Manager project that is similar to a project by HP.

6.3 Linux `md` Kernel Patch

The Linux Multi Disk (md) provides a number of block level features in various stages of development.

RAID 0 (striping) and concatenation are very solid and in production quality and also RAID 4 and 5 are quite mature.

It is also possible to stack some levels, for instance mirroring (RAID 1) two pairs of drives, each pair set up as striped disks (RAID 0), which offers the speed of RAID 0 combined with the reliability of RAID 1.

In addition to RAID this system offers (in alpha stage) block level volume management and soon also translucent file space. Since this is done on the block level it can be used in combination with any file system, even for fat using Wine.

Think very carefully what drives you combine so you can operate all drives in parallel, which gives you better performance and less wear. Read more about this in the documentation that comes with md.

Unfortunately The Linux software RAID has split into two trees, the old stable versions 0.35 and 0.42 which are documented in the official Software-RAID HOWTO and the newer less stable 0.90 series which is documented in the unofficial Software RAID HOWTO which is a work in progress.

A patch for online growth of ext2fs is available in early stages and related work is taking place at the ext2fs resize project at Sourceforge.

Hint: if you cannot get it to work properly you have forgotten to set the persistent-block flag. Your best documentation is currently the source code.

6.4 Compression

Disk compression versus file compression is a hotly debated topic especially regarding the added danger of file corruption. Nevertheless there are several options available for the adventurous administrators. These take on many forms, from kernel modules and patches to extra libraries but note that most suffer various forms of limitations such as being read-only. As development takes place at neck breaking speed the specs have undoubtedly changed by the time you read this. As always: check the latest updates yourself. Here only a few references are given.

DouBle features file compression with some limitations.
Zlibc adds transparent on-the-fly decompression of files as they load.
there are many modules available for reading compressed files or partitions that are native to various other operating systems though currently most of these are read-only.
dmsdos (currently in version 0.9.2.0) offer many of the compression options available for DOS and Windows. It is not yet complete but work is ongoing and new features added regularly.
e2compr is a package that extends ext2fs with compression capabilities. It is still under testing and will therefore mainly be of interest for kernel hackers but should soon gain stability for wider use. Check the http://e2compr.memalpha.cx/e2compr/ name="e2compr homepage"> for more information. I have reports of speed and good stability which is why it is mentioned here.

6.5 ACL

Access Control List (ACL) offers finer control over file access on a user by user basis, rather than the traditional owner, group and others, as seen in directory listings (drwxr-xr-x). This is currently not available in Linux but is expected in kernel 2.3 as hooks are already in place in ext2fs.

6.6 `cachefs`

This uses part of a hard disk to cache slower media such as CD-ROM. It is available under SunOS but not yet for Linux.

6.7 Translucent or Inheriting File Systems

This is a copy-on-write system where writes go to a different system than the original source while making it look like an ordinary file space. Thus the file space inherits the original data and the translucent write back buffer can be private to each user.

There is a number of applications:

updating a live file system on CD-ROM, making it flexible, fast while also conserving space,
original skeleton files for each new user, saving space since the original data is kept in a single space and shared out,
parallel project development prototyping where every user can seemingly modify the system globally while not affecting other users.

SunOS offers this feature and this is under development for Linux. There was an old project called the Inheriting File Systems (ifs) but this project has stopped. One current project is part of the md system and offers block level translucence so it can be applied to any file system.

Sun has an informative page on translucent file system.

It should be noted that Clearcase (now owned by Rational) pioneered and popularized translucent filesystems for software configuration management by writing their own UNIX filesystem.

6.8 Physical Track Positioning

This trick used to be very important when drives were slow and small, and some file systems used to take the varying characteristics into account when placing files. Although higher overall speed, on board drive and controller caches and intelligence has reduced the effect of this.

Nevertheless there is still a little to be gained even today. As we know, "world dominance" is soon within reach but to achieve this "fast" we need to employ all the tricks we can use .

To understand the strategy we need to recall this near ancient piece of knowledge and the properties of the various track locations. This is based on the fact that transfer speeds generally increase for tracks further away from the spindle, as well as the fact that it is faster to seek to or from the central tracks than to or from the inner or outer tracks.

Most drives use disks running at constant angular velocity but use (fairly) constant data density across all tracks. This means that you will get much higher transfer rates on the outer tracks than on the inner tracks; a characteristics which fits the requirements for large libraries well.

Newer disks use a logical geometry mapping which differs from the actual physical mapping which is transparently mapped by the drive itself. This makes the estimation of the "middle" tracks a little harder.

In most cases track 0 is at the outermost track and this is the general assumption most people use. Still, it should be kept in mind that there are no guarantees this is so.

Inner

tracks are usually slow in transfer, and lying at one end of the seeking position it is also slow to seek to.

This is more suitable to the low end directories such as DOS, root and print spools.

Middle

tracks are on average faster with respect to transfers than inner tracks and being in the middle also on average faster to seek to.

This characteristics is ideal for the most demanding parts such as swap, /tmp and /var/tmp.

Outer

tracks have on average even faster transfer characteristics but like the inner tracks are at the end of the seek so statistically it is equally slow to seek to as the inner tracks.

Large files such as libraries would benefit from a place here.

Hence seek time reduction can be achieved by positioning frequently accessed tracks in the middle so that the average seek distance and therefore the seek time is short. This can be done either by using fdisk or cfdisk to make a partition on the middle tracks or by first making a file (using dd) equal to half the size of the entire disk before creating the files that are frequently accessed, after which the dummy file can be deleted. Both cases assume starting from an empty disk.

The latter trick is suitable for news spools where the empty directory structure can be placed in the middle before putting in the data files. This also helps reducing fragmentation a little.

This little trick can be used both on ordinary drives as well as RAID systems. In the latter case the calculation for centring the tracks will be different, if possible. Consult the latest RAID manual.

The speed difference this makes depends on the drives, but a 50 percent improvement is a typical value.

Disk Speed Values

The same mechanical head disk assembly (HDA) is often available with a number of interfaces (IDE, SCSI etc) and the mechanical parameters are therefore often comparable. The mechanics is today often the limiting factor but development is improving things steadily. There are two main parameters, usually quoted in milliseconds (ms):

Head movement - the speed at which the read-write head is able to move from one track to the next, called access time. If you do the mathematics and doubly integrate the seek first across all possible starting tracks and then across all possible target tracks you will find that this is equivalent of a stroke across a third of all tracks.
Rotational speed - which determines the time taken to get to the right sector, called latency.

After voice coils replaced stepper motors for the head movement the improvements seem to have levelled off and more energy is now spent (literally) at improving rotational speed. This has the secondary benefit of also improving transfer rates.

Some typical values:



                         Drive type


Access time (ms)        | Fast  Typical   Old
---------------------------------------------
Track-to-track             <1       2       8
Average seek               10      15      30
End-to-end                 10      30      70

This shows that the very high end drives offer only marginally better access times then the average drives but that the old drives based on stepper motors are significantly worse.



Rotational speed (RPM)  |  3600 | 4500 | 4800 | 5400 | 7200 | 10000           
-------------------------------------------------------------------
Latency          (ms)   |    17 |   13 | 12.5 | 11.1 |  8.3 |   6.0

As latency is the average time taken to reach a given sector, the formula is quite simply


latency (ms) = 60000 / speed (RPM)

Clearly this too is an example of diminishing returns for the efforts put into development. However, what really takes off here is the power consumption, heat and noise.

6.9 Yoke

There is also a Linux Yoke Driver available in beta which is intended to do hot-swappable transparent binding of one Linux block device to another. This means that if you bind two block devices together, say /dev/hda and /dev/loop0, writing to one device will mean also writing to the other and reading from either will yield the same result.

6.10 Stacking

One of the advantages of a layered design of an operating system is that you have the flexibility to put the pieces together in a number of ways. For instance you can cache a CD-ROM with cachefs that is a volume striped over 2 drives. This in turn can be set up translucently with a volume that is NFS mounted from another machine. RAID can be stacked in several layers to offer very fast seek and transfer in such a way that it will work if even 3 drives fail. The choices are many, limited only by imagination and, probably more importantly, money.

6.11 Recommendations

There is a near infinite number of combinations available but my recommendation is to start off with a simple setup without any fancy add-ons. Get a feel for what is needed, where the maximum performance is required, if it is access time or transfer speed that is the bottle neck, and so on. Then phase in each component in turn. As you can stack quite freely you should be able to retrofit most components in as time goes by with relatively few difficulties.

RAID is usually a good idea but make sure you have a thorough grasp of the technology and a solid back up system.

Next Previous Contents