www.linux.org
Linux IDE-RAID Notes - I75RAID

October 2000 - April 2001
# df -H /mnt/tmp
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              1.1T   21k  1.0T   1% /mnt/tmp

Note


This page is a work in progress, with tests pending.  For a fully working system in production, please see the results with 3ware hardware.

Overview

Given the minimum-cost/maximum-capacity goal with IBM 75GB drives, a single Terabyte RAID system has four IDE controllers, 8 channels, and 15-16 RAID drives.  With such a large file system, fsck for an ext2 file system takes a very long time, so a journaling file system like Reiserfs is recommended. Reiserfs with RAID5 is not safe under 2.2, but it is safe under 2.4.  Reiserfs is now included in kernels since 2.4.1 - necessary tools must be added separately.

Hardware

Ultra ATA 100 Controllers

Promise Ultra100 Controller

Channel 0 (IDE 1) is toward the inside and channel 1 (IDE 2) is toward the outside backplate.

The previous BIOS limit of 3 Promise cards seems to lifted.

Suggestions for Promise
  1. Please make low-cost cards with more channels (even more than the 6-channels/6-drives on the SuperTRAK100).
  2. Please use only edge-facing connectors so that cards can more easily fit in adjacent slots (the lower bank of connectors on the SuperTRAK cards face out, not up).

SIIG Ultra ATA 100 Controller

Channel 0 (primary IDE) is toward the outside backplate and channel 1 (secondary IDE) is toward the inside.
Suggestions for SIIG
  1. Please IDE connectors like Promise that face the edge of the card.
  2. Please make low-cost cards with more channels.

Motherboard issues

Carefully check performance for your motherboard and PCI IDE controller cards.  There are performance problems with common motherboards, probably related to PCI-bus implementation.  Be sure to check JBOD (Just a Bunch of Disks) performance first.  JBOD tests have a simultaneous bonnie++ process for each drive, each drive has its own ext2 file system.  I wasted a lot of time testing various kernels, patches, and RAID configurations before figuring out that there were serious hardware performance problems that no amount of software tinkering can fix.

Intel CC820

    - with 600MHz Pentium III 600MHz and 128MB PC133 SDRAM
    820 motherboard - ATX, 133MHz FSB, PC133 SDRAM, UDMA ATA/66 (ICH)
    plus BIOS update to support disks greater than 32GB
Controllers are ordered by PCI slot from low-to-high.
This motherboard has the MTH problem.  It worked fine for months, but under increasing test load, it started to hang frequently.  I finally gave up on it and called Intel for a replacement.

3/9/2001 email and phone call to Intel - tech contractor will call in 5 working days (by 3/16)
3/15/2001 call from tech contractor - motherboard on order, will call on arrival in a couple of days
3/30/2001 replacement received from tech contractor - Intel came through!

Intel VC820

    - with 600MHz Pentium III and 128MB 356MHz PC700 RDRAM
    820 motherboard - ATX, 133MHz FSB, RDRAM, UDMA ATA/66 (ICH)
Controllers are ordered by PCI slot from low-to-high.
There is a write performance problem with this motherboard and the Promise controllers - compare versus the Asus A7M266.  In contrast, the SIIG controller write performance looks good.

For bonnie++ output including CPU usage, click here.

Asus CUSL2

    - with 800MHz Intel Pentium III 133MHz FSB (EB) and 128MB PC133 SDRAM
    815E Solano 2 motherboard - ATX, 133MHz FSB, PC133 SDRAM, UDMA ATA/100, 6xPCI
Controllers are ordered by PCI slot from high-to-low.
There are serious performance problems with this motherboard and both the Promise and SIIG controllers - compare versus the Asus A7M266.  Rates around 30MBps are typical of DMA performance, rates below 5MBps are typical of PIO performance.  Given that the problem occurs with just one card, tests with multiple cards have been skipped.

For bonnie++ output including CPU usage, click here.

Asus A7M266

    - with 1GHz AMD Athlon 266MHz FSB and 128MB 266MHz RAM
    AMD761 Socket A motherboard - ATX, 266MHz FSB, PC2100 DDR DRAM, UDMA/100, 5xPCI
Controllers are ordered by PCI slot from high-to-low.
The Promise and SIIG controllers perform OK on this motherboard with the following major caveats.  It sometimes takes some tinkering to get the system to boot with 3 or 4 Promise cards - I haven't figured out any consistent behavior.  When a third SIIG card is added to this motherboard, the motherboard immediately fails BIOS boot.  One motherboard got locked permanently in this state and had to be returned for replacement.

For bonnie++ output including CPU usage, click here.

Motherboard/Controller Summary

The wide range of behavior with different motherboards and different controllers clearly shows that careful hardware JBOD testing is essential before moving on to RAID tests.
 
Motherboard Promise read Promise write SIIG read SIIG write
Intel CC820 OK slow - -
Intel VC820 OK slow OK OK
Asus CUSL2 OK slow slow slow
Asus A7M266 OK OK OK/3-fails OK/3-fails

Cabling

While the ATA spec sets a limit of 18" on cables, most cases don't allow us to conform to the spec.  In my experience, the 24" cables work fine on both the Promise and SIIG cards.  I recently bought 36" cables, which seem to work with the Promise cards, but not with the SIIG cards.  David Christensen reports:
I am currently using them and have found that they tend to have very hit and miss performance.  If I buy 4 sets of the cables, invariably, 1 or 2 will not work properly.  The symptoms vary from corrupt drive ID information during POST (an obviously bad cable) to odd file corruption after tens of hours of use.  The result is that you can use them, but they need to be extensively tested before production use.

System Configuration

Asus A7M266 - with 1GHz AMD Athlon 266MHz FSB and 128MB 266MHz RAM
    AMD761 Socket A motherboard - ATX, 266MHz FSB, PC2100 DDR DRAM, UDMA/100, 5xPCI
4 Promise Technology Ultra100 - PCI Ultra ATA/100 Controller Card (PDC20267)
0 SIIG Ultra ATA 100 PCI - Dual Channel Controller (CN2474 - CMD649)
16 IBM Deskstar 75GXP - 75GB EIDE Ultra-ATA/100 7200RPM 8.5ms (37MB/s)
1 IBM Deskstar 75GXP - 15GB EIDE Ultra-ATA/100 7200RPM 8.5ms (37MB/s) [boot drive]
Antec KS011BX - ATX 18-bay black tower server case w/o power supply
Antec 761345-77055-2 - 3 pin ball bearing fan - 92mm
Enlight EN-8407362 - ATX 400W single power supply for server cases [has 6 drive-power cables]
CTG (Cables To Go) 18729 - 24in int Ultra DMA/ATA ribbon 3 connector IDE 33/66 UDMA $11
 
controller channel disk BIOS /dev/
Ultra100
0
0 0 - master   hde
1 - slave   hdf
1 0 - master   hdg
1 - slave   hdh
Ultra100
1
0 0 - master   hdi
1 - slave   hdj
1 0 - master   hdk
1 - slave   hdl
Ultra100
2
0 0 - master   hdm
1 - slave   hdn
1 0 - master   hdo
1 - slave   hdp
Ultra100
3
0 0 - master   hdq
1 - slave   hdr
1 0 - master   hds
1 - slave   hdt

Software

RedHat 7.1 (Wolverine) Linux - Important edits

Replace unapproved gcc with kgcc
# mv /usr/bin/gcc /usr/bin/gcc-
# ln -s /usr/bin/kgcc /usr/bin/gcc
or otherwise edit kernel Makefile and carefully specify kgcc where critical
# edit Makefile
CC      :=$(shell if which $(CROSS_COMPILE)kgcc > /dev/null 2>&1; then echo $(CROSS_COMPILE)kgcc; else echo $(CROSS_COMPILE)gcc; fi) -D__KERNEL__ -I$(HPATH)

# ed /etc/sysconfig/harddisks
USE_DMA=1
MULTIPLE_IO=16
EIDE_32BIT=1
LOOKAHEAD=1
EXTRA_PARAMS=
# ed /etc/rc.d/rc.sysinit
# ed /etc/rc.d/rc.sysinit
disk[0]=s; disk[1]=hda; disk[2]=hdb; disk[3]=hdc; disk[4]=hdd;
disk[5]=hde; disk[6]=hdf; disk[7]=hdg; disk[8]=hdh;
disk[9]=hdi; disk[10]=hdj; disk[11]=hdk; disk[12]=hdl;
disk[13]=hdm; disk[14]=hdn; disk[15]=hdo; disk[16]=hdp;
disk[17]=hdq; disk[18]=hdr; disk[19]=hds; disk[20]=hdt;

if [ -x /sbin/hdparm ]; then
   for device in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do

Maximum RAID size - drives per set (MD_SB_DISKS)

The RedHat default maximum is 12, the hard maximum is 27, limited by the 4096-byte superblock.  The following update installs tools that support the hard maximum.

RAID tools - update required - http://people.redhat.com/mingo/raid-patches/

# tar xf raidtools-dangerous-0.90-20000116.tar
# cd raidtools-0.90
# ./configure; make all install

Maximum File-System sizes

With the default block size of 1K, ext2 is limited to 1TB.  For reiserfs, the following update is required for file systems larger than 0.5TB.

Reiserfs tools - update required - http://www.namesys.com/
# tar xf reiserfsprogs-3.x.0*.tar
# cd reiserfsprogs-3.x.0*
# ./configure; make all install

Linux 2.4.3

Linux 2.4.3 kernel (includes Reiserfs) - Kernel-HOWTO

# VER=2.4.3
# umask 002
# mkdir /usr/src/linux-$VER; cd /usr/src/linux-$VER; tar xf linux-$VER.tar; mv linux/* .; rmdir linux
# cd ..; rm /usr/src/linux; ln -s /usr/src/linux-$VER /usr/src/linux
# cd linux-$VER
# make mrproper
# make xconfig    # (remember to enable any other drivers for SCSI support, Network device support, Sound, etc)
    Code maturity level options
    y    Prompt for development and/or incomplete code/drivers
    Multi-device support (RAID and LVM)
    y    Multiple devices driver support (RAID and LVM)
    y        RAID support
    y            Linear (append) mode
    y            RAID-0 (striping) mode
    y            RAID-1 (mirroring) mode
    y            RAID-4/RAID-5 mode
    ATA/IDE/MFM/RLR support
        IDE, ATA and ATAPI Block devices
    y    Generic PCI bus-master DMA support
    y        Use PCI DMA by default when available
    y        CMD64X chipset support
    y        Intel PIIXn chipsets support
    y            PIIXn Tuning support
    y        PROMISE PCD PDC20246/PDC20262 support
    y            Special UDMA Feature
    y        VIA82 CXXX chipset support
    SCSI support
        SCSI low-level drivers
    ...
    Network device support
        Ethernet (10 or 100Mbit)
    ...
   File Systems
    y    Reiserfs support
        Network File Systems
    y    NFS file system support
    y        Provide NFSv3 client support
    y    NFS server support
    y        Provide NFSv3 server support
    y    SMB file system support (to mount Windows shares etc.)
    Sound
    ...
# make dep clean bzImage modules modules_install
# sh scripts/MAKEDEV.ide
# cp arch/i386/boot/bzImage /boot/vmlinuz-$VER
# cp System.map /boot/System.map-$VER
# ed /etc/lilo.conf
image=/boot/vmlinuz-2.4.3
    label=linux
    read-only
    root=/dev/hda5
# lilo    # LILO mini-HOWTO, BootPrompt-HowTo
# reboot

Disk Configuration

Standard ext2fs setup/test example

    # cfdisk /dev/hde
    # mke2fs /dev/hde1
    # mount -t ext2 /dev/hde1 /mnt/tmp
    # cd /mnt/tmp
    # bonnie++ -u root -s 1000 -m hde
    # cd /
    # umount /mnt/tmp

Reiserfs setup/test example

    # cfdisk /dev/hde
    # mkreiserfs /dev/hde1
    # mount -t reiserfs /dev/hde1 /mnt/tmp
    # cd /mnt/tmp
    # bonnie++ -u root -s 1000 -m hde
    # cd /
    # umount /mnt/tmp

RAID setup/test example

    # cfdisk /dev/hde # make partitions
    # ed /etc/raidtab # see http://ostenfeld.dk/~jakob/Software-RAID.HOWTO/
    # mkraid /dev/md0
    # cat /proc/mdstat
    # mkreiserfs /dev/md0
    # mount -t reiserfs /dev/md0 /mnt/tmp
    # cd /mnt/tmp
    # bonnie++ -u root -s 4000 -m md0
    # cd /
    # umount /mnt/tmp
    # raidstop /dev/md0

Performance

Graphs

For bonnie++ output including CPU usage, click here.

graph of bonnie++ block performance

For 15-drive RAID0 and 16-drive RAID5, mke2fs fails - the effective size limit for ext2 with 1K block size is 1TB.

Note that this measurement for RAID5 read performance (cyan for ext2, navy for reiserfs) is poor and is lower than write preformance (yellow for ext2, orange for reiserfs).  This is unexpected.  Further test show that this is repeatable, still with good JBOD performance.  Comparison with the 3WRAID is interesting.  Note that the Promise-based system is IDE at the driver level, while the 3ware-based system appears as SCSI at the driver level even though the disks are IDE.

Neil Brown ran some tests on a 7-drive SCSI RAID array that help to show that overall the write performance
slowness was specific to my previous IDE configuration, and probably also specific to IDE with this current configuration - see http://cgi.cse.unsw.edu.au/~neilb/wiki/index.php?LinuxRaidTest.  Note that the motherboard tests clearly show the performance problems due to certain combinations of motherboard and IDE-controller hardware.  The previous configuration suffered fundamentally from these problems.  The current configuration does not have these problems as shown by the JBOD tests, but I'm still suspicious of the hardware as well as the software.

Disappointing performance of software RAID, esp. write performance, was reported by Nils Rennebarth in the
http://linux24.sourceforge.net/ Linux 2.4 Kernel TODO List.  This was also reported by several people in
http://slashdot.org/articles/00/06/24/1432213.shtml - Slashdot | Linux 2.4.0 Test2 Almost Ready for Prime Time.

This system is still under test.

Additional Notes

Status from /proc/mdstat shows huge resync times.  Resync appears to work as advertized, as performance after resync completion appears to be the same.  To increase md resync speed:

Performance Summary

key & configuration Bonnie read
MB/sec
Bonnie write
MB/sec
Comment
PIO ex. I34GXP 4.1 4.3 Promise Ultra66
I75GXP I66 36.5 26.9 Intel CC820 ICH
I75GXP P100 36.5 29.4 Promise Ultra100
I75GXP P100 ReiserFS 35.8 35.4 Promise Ultra100
I34RAID  66.8  35.6 Promise Ultra66 
M40RAID 46.6 35.5 mixed controllers
S18RAID 39.5 36.7 2940U2W W/LW mix
3WRAID 62.5 30.4 3ware Escalade 6800 JBOD (SW RAID5)
I75RAID-15-ext2  28.7 45.9 Promise Ultra100
I75RAID-16-reiserfs     Promise Ultra100

Explanation for the above, in order of test:
PIO ex. I34GXP - PIO reference
I75GXP I66 - Intel PIIX4 reference
I75GXP P100 - Promise Ultra100 reference
I34RAID,  M40RAID, S18RAID - reference

hdparm

    http://www.oreillynet.com/pub/a/linux/2000/06/29/hdparm.html
    # top
    # hdparm -Tt /dev/hde # measure device reads, run at least 3x

Bonnie

   http://www.textuality.com/bonnie/
   # Bonnie -s 500 -html -m ...

bonnie++

    http://www.coker.com.au/bonnie++/ - adds directory benchmarks and synchronized multiple processes
    To enable SystemV Shared Memory (for semaphores), see Documentation/Changes in kernel source
    # mkdir /dev/shm
    # ed /etc/fstab
    none                    /dev/shm                shm     defaults        0 0
    # reboot
    # bonnie -p 2 -u root
    # for i in 1 2; do bonnie -y -u root -s 500 -m ... & done
 
Sequential Output Sequential Input Random
Per Char Block Rewrite Per Char Block
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
PIO ex. I34GXP 500 2845 64.1 4315 50.5 2053 10.5 2743 32.0 4114 5.4 86.9 2.1
I75GXP I66 500 9602 97.2 26937 17.7 14687 19.8 9727 93.8 36482 21.2 155.8 1.7
I75GXP P100 500 9629 97.5 29428 20.5 15312 19.5 9798 94.9 36462 21.7 158.4 1.7
I75GXP P100 ReiserFS 500 7972 98.2 35377 65.1 15243 24.9 8509 93.0 35762 27.2 148.8 2.5
I34RAID 500 7251 91.9 35571 30.2 18232 35.0 8134 95.9 66774 46.8 207.6 3.0
M40RAID 500 7443 91.3 35546 29.5 17707 34.0 8251 95.4 46554 32.6 322.3 4.4
S18RAID 500 4857 98.3 39451 78.8 16078 55.2 6533 95.0 36652 35.6 495.8 11.8
3WRAID 4000 11770 85 30398 13 21990 20 11050 82 62470 49 245.1 1

Additional References

http://www.linux-ide.org/ - Linux ATA Development and Linux Disk Certification Project
http://www.linuxdiskcert.org/ - See ide.2.4.16.12102001.patch for Andre Hedricks latest IDE performance improvements
http://www.cse.unsw.edu.au/~neilb/patches/linux/ - Neil Brown's RAID patches
http://lists.omnipotent.net/reiserfs/200008/msg00656.html - What's up with NFS and ReiserFS
http://lists.omnipotent.net/reiserfs/200008/msg00632.html - Lexa patch is applied to the 2.4 kernel
ftp://ftp.redhat.com/pub/redhat/beta/fisher/ - RedHat 7.1 beta

NoBell Home - gjm - last update 4/12/2001