Xe

From CITA Computing

Table of contents

See Also

Xe Lustre and Xe Errors and Xe Production

TODO

give nodes fixed ip adresses 'cos at the moment if dhcp fails it de-configures the eth interfaces if the front-end is down for too long (it probably shouldn't do that - I don't remember it doing it before - maybe it's an ifcfg-eth0 setting?). can still get in on the IB interfaces or over SoL and fix the problem though, so it's not super-major.

Events

See also Xe Errors

  • Nov 28 2007
    • put qla2xxx in x17,18,19 and attached each to 3 daisy-chained 73G tp9100's
    • using ac1,2's lower ix brick jobfs fibres and patch panels to do this
  • Sep 27 2007
    • moved x1's FC card to xe. reseated x1's memory. (a day later moved the FC card back to x1)
    • lots of iSER. various OFED versions tested with centos5.
    • trying single xe OSS with both SAS and FC disks (not spectacular, 3oss's (2 fc, 1 sas) much snazzier)
  • Aug 31 2007 - lustre 1.6.2 and centos5 are the defaults for clients now. head node still centos4.
    • iscsi RDMA testing
    • lustre 1.6.1 bug reports
    • lustre quotas
  • Aug 18 2007 - reseated all disks in x2's tp9100 and the 2 or 3 slow disk problem seems to have gone away for now.
  • working on getting commercial package built on x86_64 with gcc4 and gfortran
    • everything except input parser seems to be working
    • ifort version still quicker though it seems. blas libs seem mostly irrelevant to speed
  • July 9 2007 - disks f and j (renamed e, i by the driver now) in the tp9100 on x2 are slow - 49MB/s wheras the rest hdparm at 72
  • July 6 2007 - replacement IB switch with full working DDR fabric in the backplane installed
    • SATA problems being worked on by SGI now. seems like vibration vs. hitachi.
    • Lustre over IB and GigE at the same time to different sets of nodes seems to work ok. quotas still to check. separate MGS and MDS/MDT tested. hitting more LBUG's than usual... got this one with the GigE/IB test:
Jul  4 21:52:30 x1 kernel: LustreError: 12049:0:(filter.c:1575:filter_iobuf_get()) ASSERTION(thread_id < filter->fo_iobuf_count) failed 
Jul  4 21:52:30 x1 kernel: LustreError: 12049:0:(tracefile.c:433:libcfs_assertion_failed()) LBUG 
  • June 20 2007 - x3 motherboard replaced and seems ok
  • June 18 2007 - CentOS5 oneSIS image verified working, including kmod-xfs, and optional kernel.org kernel
    • also x1's sles10 restored from a dd from x2, and x6's sles10 restored from x4
    • sdb's wiped to help with the ongoing SATA testing
    • tested re-install of centos4.5 to a sdb
  • May 30 2007 - oneSIS image's rc.sysinit fixed up for centos 4.5 - mk-sysimage happy. x19 boots diskless lustre.
  • May 28 2007 - turns out our IB switch isn't anywhere near full non-blocking - is about 1/8 bandwidth
  • May 24 2007 - CentOS 4.5
  • May 8 2007 - swap over network to iSCSI is working. 2.6.21.1 kernel, patched with netswap v12-2.6.21. I'm told that NFS in 2.6.21 isn't good, but netswap to NFS actually seemed to work fine with 2.6.21-rc3-netswap20070319.
  • May 2 2007 - CentOS 5 image built. yum upgrade failed at all(?) %post scripts, so did a PXE install using the CD installer vmlinuz and initrd over serial console instead. worked fine. minimal tidying required. x18,x19 running it at the moment. both CentOS 5 kernel or RHEL4-lustre 1.6.0 kernel (without udev) seem to work ok. local exclude list for systemimager updated.
    • running mpi bonnie++ and larger bonnie chunk tests to SAS.
  • Apr 27 2007 - sdc and sdi SAS disks on xe replaced via hotswap. so the really confusing thing now is that the disks got relabeled by the SCSI drivers somehow, so that the unique ids of the disks are now /dev/sd[c-l,w,y] instead of /dev/sd[c-n]. see the is120 section for more info.
  • Apr 24 2007 - a sync; sync; sleep 30 before rmmod'ing Lustre modules seems to have stopped the repeatable crashes with the small file bonnie++ runs. those have been running for about a week now with no crashes on OSSs (x1,x2 FC disks) or MDS (x17 SATA disk or ramdisk) or clients so far.
  • Apr 12 2007 - a run with 4 cpus on the MDS hung part way in an umount. did a cleanLustre and killed the hung ssh processes trying to do the umount and it seems to have proceeded ok from that point.
  • Apr 10 2007 - temporarily blame the 2 dying SAS disks for the xe lockups and have asked SGI for replacements. going back to lustre small file testing using FC on x1,x2 as storage. setup x1,x2 with MSI on mptbase, and MSI-X on ib_mthca. doing x18 MDS/MDT on ramdisk. lustre survived one small file run... (Apr 11) lustre on x1,x2 survived multiple small file runs. starting a 1 cpu on oss, mds set of tests to see where/if cpu power is required.
  • Apr 7 2007 - turned on msi in mptbase for 2.6.20.4 and maybe the lockups have stopped now... ?? sdc and sdi's SMART data says they're dying, so need to get them replaced.
  • Apr 4->7 2007 - xe locking up with local SAS raid tests alone.
  • Apr 3 2007 - xe is still crashing, so moved FC card and that Lustre OSS to x2. put SAS card back into xe.
  • Apr 1 2007 - lots of crashes on xe. moved login node (xe.anu.edu.au) to x7 so can ipmi reset xe when it has problems. xe's ipmi reconfigured to use channel 2 (was setup on channel 1).
    • added options mptbase mpt_msi_enable=1 and now each MPT ioc gets its own interrupt instead of ioc0, ioc1, ib_mthca sharing an interrupt. might help.
    • testing all fc drives together and separately to see if the crashes are due to a hardware problem there.
    • 1st SAS drive (sdc) says it has 6 uncorrectable errors. dd of /dev/zero to the disk didn't fix it.
    • tested msi=1 and msi-x=1 with netpipe on x2,x3 and made no difference
  • Mar 23 17:00 2007 - is120 sas JBOD attached to xe
  • Mar 23 10:20am 2007 - moved tp9100 connected to xe from the xe rack to the actest rack
    • xe was deliberately left running a lustre job whilst the tp9100 fibre was unplugged and the unit moved. lots of scsi errors for ~1 minute, then many many more lustre errors occurred. after re-connecting the tp9100 logins to the xe were ok, but after a while it hung (while doing a cat /proc/mdstat but was still spooling out more lustre errors. the fibres were reversed (50% chance of the right order) and the node stayed hung. the xe node wasn't responding to ctrl-alt-del or sysrq, so it was hard power cycled, then raid0's restarted, lustre mounted and it was fine. the filesystem was intact, and all the files looked ok, but processes on the first node in the MPI lustre job had died leaving the rest orphaned. orphaned MPI processes were killed and the job restarted. as the MPI run was started from the xe, it can't reasonably have been expected to survive the reboot of xe intact.
  • approx Mar 21 11am 2007 - IB switch set to SDR (enable, config, ddr, set-fabric-to-sdr)
    • all 3(6 ports) spine chips and 2(6 ports) line chips report they're at SDR. all HBAs say they're at DDR. (enable, utilities, port-verify)
  • Mar 9 14:58:11 2007 - raised the memlock limit in limits.conf to 1g (from 128m) to see if that helps with switch crashes (maybe it did as no switch crashes since?)

Hardware

Nodes

  • 19 SGI xe210
dual Xeon 5150 @ 2.66GHz (4 cores/node)
8G ram. DDR2 667 on 1333 FSB
IB DDR (20Gbit transport, max 16Gbit data) on PCIe x8 (20Gbit)
2x e1000 gigE (one connected)
 - node x1,x16-x19 has the same FC card as front-end
 - nodes x18,x19 have 1/2 the ram and 1/2 the number of cores (only 1 socket filled)
  • front-end SGI xe240
dual Xeon 5150 @ 2.66GHz (4 cores/node)
8G ram
IB DDR (20Gbit transport, max 16Gbit data) on PCIe x8 (20Gbit)
2x e1000 gigE
dual fibre channel (2Gbit each, 4Gbit total = 400MB/s) on PCI-X (133/64 = 1Gbyte/s)
dual port SAS controller (12Gbit each) on PCIe x8 (lspci says x8, but m/b manual says x4)
  • supermicro nodes (xe, xemds)
dual Xeon 5462 @ 2.8GHz (8 cores/node)
8G ram. DDR2 800 on 1600 FSB
IB DDR (20Gbit transport, max 16Gbit data) on PCIe x8 (20Gbit) built into m/b
2x e1000 gigE
xemds has a PCIe FC card

Storage

  • FC
10 JBOD tp9100's. one attached to xemds, 3 to each of x17-x19
4 FC ports each, but only 2 used, so 4Gbit
16 73GB 15k rpm disks
  • SAS (loaner, now returned)
1 infinite storage 120 attached to xe
dual controller setup. x4 SAS host interface, so 12Gbit each
12 300G 10k rpm maxtor disks

Networking

  • nodes
gigE 10.0.1.[1-19] /16           x*
BMC/IPMI/SOL 10.0.40.[1-19] /16  x*bmc
IB 192.168.1.[1-19] /16          x*ib
  • front-end
external                         xe.anu.edu.au
BMC/IMPI/SOL not setup
gigE 10.0.10.1 /16               xe
IB 192.168.10.1 /16              xeib
  • switches
    • smc 48port model 8848m
    • voltaire 288port IB DDR model 9288
10.0.20.1                        smcswitch
10.0.21.1 /16                    voltaireswitch
192.168.2.100 /16                voltaireswitch-ib

OS/Install

SLES 10 and scali cluster-something came on the box from SGI, but SLES is buy-ware and upgrading was a pain (licenses, pah!). propack was easily orphaned by any SLES upgrade (broken rpm dependencies) which could easily toast the SGI versions of OFED et al. neither SLES 10 nor scali seemed to have any way to image backends, or update or install packages on them, or indeed to do simple things like push out passwd files or accounts to them. overall it wasn't enjoyable. sooo....

now the cluster triple-boots SLES10, CentOS4.4, and diskless oneSIS.

  • SLES10 is on the first disk of each node and is in grub.conf for each node
    • invoked if pxeboot fails or redirects to localdisk
  • OSCAR 5 install of CentOS 4.4 x86_64 is on the 2nd disk of each node
    • invoked via a netboot'd kernel from pxe/tftp
    • master copy of the OS image is /var/lib/systemimager/images/oscarimage-centos-4 on the front-end
    • push out updates with cpushimage oscarimage-centos-4. excludes for the rsync are in the image in /etc/systemimager/updateclient.local.exclude
  • oneSIS 2.0rc10 diskless booting with ro root over NFS and rw PBS spool dirs
    • invoked via a netboot'd kernel from pxe/tftp with root on ramdisk/NFS, but could be on ramdisk/Lustre
    • root of the OS the /var/lib/oneSIS/centos-4 dir on the front-end
  • 2.6.20-rc4 kernel (installed to help sort out IB and swap-over-network) reveals xe's SATA disks are possibly broken with NCQ (likely no big deal). linux-ide list informed, patches tested etc. symptom is a pile of errors in dmesg. most likely these SATA drives will just be blacklisted
  • OSCAR modified for fixes, bugs, and to stop it touching the sles10 partitions
  • GigE switch's config modified to allow ganglia's multicast to work
    • in the GUI, IGMP Snooping -> IGMP Configuration -> IGMP Status
    • on the command line no ip igmp snooping
  • need to boot with selinux=0 (just disabled isn't enough) otherwise rpm %pre and %post scriptlets fail for no reason
    • could re-enable selinux fully, but OSCAR people claim there are problems in the OSCAR chroot'd image in that case.

add the various hacks and bugs (mostly OSCAR) etc. here at some stage.

oneSIS

the basics are to install the oneSIS (http://www.onesis.org/) rpm, edit /etc/sysimage.conf, run mk-sysimage to build the links and patch rc.sysinit, and then mk-initrd-oneSIS to build an initrd with network and NFS preloaded. run mk-sysimage as many times as you like - it's smart and only does each thing once. running nodes can also be updated with update-node [-r|-d] and diskful hierarchical sub-master (or just diskful) nodes can be updated with sync-node (I haven't tried that yet).

to get CentOS 5 working with oneSIS 2.0rc10, I rsync'd over the current OSCAR CentOS5 image (which was updated via anaconda from the OSCAR installed CentOS4), then installed the oneSIS rpm - the CentOS4 rpm works ok. I made a new distro patch for as5 (http://www.cita.utoronto.ca/~rjh/wiki/Xe/redhat-el-as5.patch) (lives on the master and not in the image) and I needed to alter the master's mk-initrd-oneSIS to use mke2fs -b 4096 ... instead of just mke2fs ..., as new RHEL/CentOS kernels don't like ext2 initrd's built with 1k blocks - cvs oneSIS might create a cpio initrd which would also solve this.

I also added a

mount -t rpc_pipefs sunrpc /var/lib/nfs/rpc_pipefs

to near the end of rc.sysinit. still, NFS gives some FS-Cache warnings... not sure what to do about that as NFS seems to be working ok without it.

IPMI/SOL

Serial-over-LAN and all of normal IPMI (eg. power control) works on all nodes. not being used on the front-end. not yet secured with good passwds/auth levels. also the machine arrived with an auth problem where anyone can get into IPMI (not SOL) without a passwd, and hence reboot and check SEL logs etc.

A cvs version of IPMI is needed to get SOL working. ipmitool-1.8.9-cvs20070110 Below is pretty much how SGI setup IPMI on nodes with SLES10. this is how to set it up from scratch with CentOS 4 on the Left hand NIC (channel 1, eth0 in Linux (2.6 kernel)). the Right hand NIC (eth1 in Linux) is channel 2. for the front-end xe node, channel=1.

  • in /etc/inittab
# Added for SOL
cons:1235:respawn:/sbin/agetty -h -L 115200 ttyS1
  • add ttyS1 to /etc/securetty
  • impitool line for SOL:
ipmitool -I lanplus -H <bmcIP> -U admin -P <passwd> -o intelplus -v sol activate
  • normal IPMI command for reboot etc.
ipmitool -H <bmcIP> -U admin chassis power reset
  • SOL setup IP and access:
ipmitool channel info <channel>
ipmitool lan set <channel> ipaddr <someIP>
ipmitool lan set <channel> netmask 255.255.0.0
ipmitool lan set <channel> auth ADMIN MD5,PASSWORD
ipmitool lan set <channel> ipsrc static
ipmitool lan set <channel> arp respond on
ipmitool lan set <channel> arp generate on
ipmitool lan set <channel> arp interval 5
ipmitool lan print <channel>
ipmitool lan set <channel> access on

  • setup a user called 'admin'
ipmitool user set name 2 admin
ipmitool user set password 2 <some passwd>
#ipmitool user priv 2 4 <channel>
ipmitool channel setaccess <channel> 2 callin=on ipmi=on link=on privilege=4
ipmitool user list <channel>
ipmitool user enable 2
  • SOL setup continues over the lanplus interface:
ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol info
ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set privilege-level admin
ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set non-volatile-bit-rate 115.2
ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set volatile-bit-rate serial
ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set force-encryption true
ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set enabled true
ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set retry-interval 2
ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol payload enable <channel> 2

New Node

a new node may need GUI BIOS turned off. also assumed (original) node serial console settings are RTS/CTS, 115k, vt100 (port b, legacy enabled should be ok). also check on the processor settings - previous nodes have hardware prefetch set. dual-cache line loading is probably set on all processors. might be worth playing with these two settings.

InfiniBand

main trauma here was err, no clue how to set it up! turns out there's nothing to setup really. CentOS does it for you with an /etc/init.d/openib start, then ifup ib0.

problems from there were:

  • users need to be able to lock pages of memory (http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages). this is set in /etc/security/limits.conf where sizes are in KB.
    • seems likely that this required size would scale with # of nodes in a job and the size of messages they're sending(?)
      • 16M is too small for 16 node mpirun N HPL (ie. 16 MPI processes with 4 goto/dgemm threads per node), 128M seems ok for this test. 128M is also ok for 16 node mpirun C which is 64 MPI processes, so leave it at 128M for now... 1G was the setting for much of the testing. 1G:
*                soft    memlock         1048576
*                hard    memlock         1048576
  • this is diverging from IB a but, but /etc/init.d/pbs_mom also needs limits set for jobs. ie. at the top of the script
ulimit -n 32768
ulimit -l 1048576
ulimit -s unlimited
  • SGI/Intel's BIOS doesn't set MaxReadReq for PCIe correctly. wisdom online has it that OFED 1.0 stacks work well, but OFED 1.1 based stacks don't do dodgy hacks for broken BIOS's and so go slowly. the upshot is that IB with newer kernels (strangely not OFED 1.1 based AFAICT ... ???) goes slowly at the minimum MaxReadReq of 256. workaround is to use stock CentOS 4.4 (RHEL AS4) kernels which set MaxReadReq to 512. 2 workaround for this are setpci where MaxReadReq can be set to whatever you like, and the tune_pci=1 option ib_mthca which seems to set it to the max of 4096.

Observed settings from lspci -vvv are:

Kernel version OS MaxReadReq(bytes)
2.6.16.21-0.8-smp sles10 4096
2.6.9-42.0.3.ELsmp centos4.4 512
2.6.9-42.0.3.EL_lustre.1.5.97smp centos4.4 512
2.6.19.2 centos4.4 128
2.6.18-1.2732.4.2.el5.OFED_1_1 centos4.4 128
2.6.20-rc4 centos4.4 128
2.6.9-42.0.10.EL_lustre-1.6.0.1smp centos4.5 512
anything + tune_pci=1(*) centos4.4/4.5 4096

(*) options ib_mthca tune_pci=1 in /etc/modprobe.conf


  • SGI says:
 I'm not certain that there is a "correct" setting. At the moment
 the XE BIOS sets the default parameter to 512 and we have tested
 HCA's using this value and performance (although not optimal) is
 acceptable. If you want optimal performance you can use the tune_pci
 option to set MaxReadReq to 4096.
 To be honest, we don't know if there are any "side effects" with
 setting MaxReadReq to 4096.
 We are still in discussion with Intel over what the right thing
 for the BIOS to do in this case.

which sounds fair enough.

tiered switch internals

with the IB switch in SDR mode, NICs in DDR, with centos4.5, kernel 2.6.9-42.0.10.EL_lustre-1.6.0.1smp I'm seeing fast/slow groupings of nodes. so it's 11.4Gbit (as reported by NPmpi) within one of these groups, and 7.4Gbit between groups:

xe,9-13
2,14-19

where 1,3-8 are currently down or in sles10.

within one of the 2 subsets above, 6 simultaneous pairwise netpipes all give 11+Gbit. across the 2 subsets I get a scattergram of between 3.5 and 6.5 Gbit instead of the expected 7.4Gbit from SDR. so bandwidth reduction is clearly seen even when using 6 pairwise netpipes instead of maximum 12 at once that could be run with this switch configuration if 24 nodes were plugged in and on.

this means that the switch has bottlenecks in it. the below netpipe plots are mostly between x12 and x13 which are (luckily) all in the fast regime.

internally the switch looks like (from enable->utilities->port-verify)

#
# Topology file: generated on Mon Apr  9 19:17:51 2007
#
Printing Chassis 1 (chassis guid 0x0008f104004011a8)

devid=0x5a32
switchguids=0x8f104004011a9 Chassis ISR9288 1 Spine 1  Chip 1
Switch  24 "S-0008f104004011a9"         # "ISR9288 Voltaire sFB-12D" smalid 4
[1] "S-0008f104003f1576"[1] width 4X speed 2.5 Gbs
[2] "S-0008f104003f1577"[1] width 4X speed 2.5 Gbs

devid=0x5a32
switchguids=0x8f104004011aa Chassis ISR9288 1 Spine 1  Chip 2
Switch  24 "S-0008f104004011aa"         # "ISR9288 Voltaire sFB-12D" smalid 5
[1] "S-0008f104003f1576"[2] width 4X speed 2.5 Gbs
[2] "S-0008f104003f1577"[2] width 4X speed 2.5 Gbs

devid=0x5a32
switchguids=0x8f104004011ab Chassis ISR9288 1 Spine 1  Chip 3
Switch  24 "S-0008f104004011ab"         # "ISR9288 Voltaire sFB-12D" smalid 1
[1] "S-0008f104003f1576"[3] width 4X speed 2.5 Gbs
[2] "S-0008f104003f1577"[3] width 4X speed 2.5 Gbs

devid=0x5a34
switchguids=0x8f104003f1576 Chassis ISR9288 1 Line  1  Chip 1
Switch  24 "S-0008f104003f1576"         # "ISR9288/ISR9096 Voltaire sLB-24D" smalid 2
[1] "S-0008f104004011a9"[1] width 4X speed 2.5 Gbs
[2] "S-0008f104004011aa"[1] width 4X speed 2.5 Gbs
[3] "S-0008f104004011ab"[1] width 4X speed 2.5 Gbs
[13][ext 6] "H-0008f10403979814"[1] width 4X speed 5.0 Gbs    - x14
[14][ext 5] "H-0008f10403979844"[1] width 4X speed 5.0 Gbs    - x15
[15][ext 4] "H-0008f10403979e0c"[1] width 4X speed 5.0 Gbs    - x16
[16][ext 18] "H-0008f10403979854"[1] width 4X speed 5.0 Gbs   - x2
[18][ext 16] "H-0008f104039798fc"[1] width 4X speed 5.0 Gbs   - x5
[19][ext 1] "H-0008f10403979818"[1] width 4X speed 5.0 Gbs    - x19
[20][ext 2] "H-0008f1040397992c"[1] width 4X speed 5.0 Gbs    - x18
[21][ext 3] "H-0008f10403979934"[1] width 4X speed 5.0 Gbs    - x17
[22][ext 13] "H-0008f10403979850"[1] width 4X speed 5.0 Gbs   - x7
[23][ext 14] "H-0008f10403979858"[1] width 4X speed 5.0 Gbs   - x6
[24][ext 15] "H-0008f10403979dc0"[1] width 4X speed 5.0 Gbs   - x4

devid=0x5a34
switchguids=0x8f104003f1577 Chassis ISR9288 1 Line  1  Chip 2
Switch  24 "S-0008f104003f1577"         # "ISR9288/ISR9096 Voltaire sLB-24D" smalid 3
[1] "S-0008f104004011a9"[2] width 4X speed 2.5 Gbs
[2] "S-0008f104004011aa"[2] width 4X speed 2.5 Gbs
[3] "S-0008f104004011ab"[2] width 4X speed 2.5 Gbs
[13][ext 12] "H-0008f1040397998c"[1] width 4X speed 5.0 Gbs    - x8
[14][ext 11] "H-0008f10403979e30"[1] width 4X speed 5.0 Gbs    - x9
[15][ext 10] "H-0008f10403979820"[1] width 4X speed 5.0 Gbs    - x10
[18][ext 22] "H-0008f10403980ee8"[1] width 4X speed 5.0 Gbs    - external SGI 1
[19][ext 7] "H-0008f1040397981c"[1] width 4X speed 5.0 Gbs     - x13
[20][ext 8] "H-0008f10403979888"[1] width 4X speed 5.0 Gbs     - x12
[21][ext 9] "H-0008f10403979834"[1] width 4X speed 5.0 Gbs     - x11
[22][ext 19] "H-0008f1040397982c"[1] width 4X speed 5.0 Gbs    - x1
[23][ext 20] "H-0008f1040397e148"[1] width 4X speed 5.0 Gbs    - xe
[24][ext 21] "H-0008f10403980d7c"[1] width 4X speed 5.0 Gbs    - external SGI 2

which to me looks like the internals are wired in the typical IB tree fashion, like

which looks a lot like 12 4x DDR HBA's are going through 3 4x SDR internal uplinks (will be DDR one day)... so this really doesn't look like a full bandwidth 24port switch! more like 1/4 bw at best and 1/8 bw at the moment :-(

as of july 2007 the new DDR switch internals are:

#
# Topology file: generated on Fri Feb 11 09:55:20 2028
#
Printing Chassis 1 (chassis guid 0x0008f10400401910)

devid=0x5a37
switchguids=0x8f10400401911 Chassis ISR2012 1 Spine 1  Chip 1
Switch 24 "S-0008f10400401911"      # "ISR2012 Voltaire sFB-2012" smalid 4
[1] "S-0008f104003f2084"[1] width 4X speed 5.0 Gbs
[2] "S-0008f104003f2085"[1] width 4X speed 5.0 Gbs

devid=0x5a37
switchguids=0x8f10400401912 Chassis ISR2012 1 Spine 1  Chip 2
Switch 24 "S-0008f10400401912"      # "ISR2012 Voltaire sFB-2012" smalid 5
[1] "S-0008f104003f2084"[2] width 4X speed 5.0 Gbs
[2] "S-0008f104003f2085"[2] width 4X speed 5.0 Gbs
                         
devid=0x5a37             
switchguids=0x8f10400401913 Chassis ISR2012 1 Spine 1  Chip 3
Switch 24 "S-0008f10400401913"      # "ISR2012 Voltaire sFB-2012" smalid 1
[1] "S-0008f104003f2084"[3] width 4X speed 5.0 Gbs
[2] "S-0008f104003f2085"[3] width 4X speed 5.0 Gbs

devid=0x5a38
switchguids=0x8f104003f2084 Chassis ISR2012 1 Line  1  Chip 1
Switch 24 "S-0008f104003f2084"      # "ISR2012/ISR2004 Voltaire sLB-2024" smalid 2
[1] "S-0008f10400401911"[1] width 4X speed 5.0 Gbs
[2] "S-0008f10400401912"[1] width 4X speed 5.0 Gbs
[3] "S-0008f10400401913"[1] width 4X speed 5.0 Gbs
[13][ext 13] "H-0008f10403979844"[1] width 4X speed 5.0 Gbs
[14][ext 14] "H-0008f10403979814"[1] width 4X speed 5.0 Gbs
[15][ext 15] "H-0008f1040397981c"[1] width 4X speed 5.0 Gbs
[16][ext 16] "H-0008f10403979858"[1] width 4X speed 5.0 Gbs
[17][ext 17] "H-0008f10403979854"[1] width 4X speed 5.0 Gbs
[18][ext 18] "H-0008f1040397982c"[1] width 4X speed 5.0 Gbs
[19][ext 19] "H-0008f10403979820"[1] width 4X speed 5.0 Gbs
[20][ext 20] "H-0008f10403979e30"[1] width 4X speed 5.0 Gbs
[21][ext 21] "H-0008f10403979888"[1] width 4X speed 5.0 Gbs
[24][ext 24] "H-0008f10403980d7c"[1] width 4X speed 5.0 Gbs

devid=0x5a38
switchguids=0x8f104003f2085 Chassis ISR2012 1 Line  1  Chip 2
Switch 24 "S-0008f104003f2085"      # "ISR2012/ISR2004 Voltaire sLB-2024" smalid 3
[1] "S-0008f10400401911"[2] width 4X speed 5.0 Gbs
[2] "S-0008f10400401912"[2] width 4X speed 5.0 Gbs
[3] "S-0008f10400401913"[2] width 4X speed 5.0 Gbs
[13][ext 1] "H-0008f10403979818"[1] width 4X speed 5.0 Gbs
[14][ext 2] "H-0008f1040397992c"[1] width 4X speed 5.0 Gbs
[15][ext 3] "H-0008f10403979850"[1] width 4X speed 5.0 Gbs
[16][ext 4] "H-0008f10403979dc0"[1] width 4X speed 5.0 Gbs
[17][ext 5] "H-0008f104039798fc"[1] width 4X speed 5.0 Gbs
[18][ext 6] "H-0008f1040397985c"[1] width 4X speed 5.0 Gbs
[19][ext 7] "H-0008f1040397e148"[1] width 4X speed 5.0 Gbs
[20][ext 8] "H-0008f10403979934"[1] width 4X speed 5.0 Gbs
[21][ext 9] "H-0008f10403979e0c"[1] width 4X speed 5.0 Gbs
[22][ext 10] "H-0008f1040397998c"[1] width 4X speed 5.0 Gbs
[23][ext 11] "H-0008f10403980ee8"[1] width 4X speed 5.0 Gbs
[24][ext 12] "H-0008f10403979834"[1] width 4X speed 5.0 Gbs

devid=0x6274
Hca    1 "H-0008f10403980d7c"   # "SGI HCA-2"
[1] "S-0008f104003f2084"[24]  # lid 26 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979888"   # "Voltaire HCA410Ex-D"             - x12
[1] "S-0008f104003f2084"[21]  # lid 16 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979e30"   # "Voltaire HCA410Ex-D"             - x9
[1] "S-0008f104003f2084"[20]  # lid 13 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979820"   # "Voltaire HCA410Ex-D"             - x10
[1] "S-0008f104003f2084"[19]  # lid 25 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f1040397982c"   # "Voltaire HCA410Ex-D"             - x1
[1] "S-0008f104003f2084"[18]  # lid 15 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979854"   # "Voltaire HCA410Ex-D"             - x2
[1] "S-0008f104003f2084"[17]  # lid 18 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979858"   # "Voltaire HCA410Ex-D"             - x6
[1] "S-0008f104003f2084"[16]  # lid 12 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f1040397981c"   # "Voltaire HCA410Ex-D"             - x13
[1] "S-0008f104003f2084"[15]  # lid 19 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979814"   # "Voltaire HCA410Ex-D"             - x14
[1] "S-0008f104003f2084"[14]  # lid 20 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979844"   # "Voltaire HCA410Ex-D"             - x15
[1] "S-0008f104003f2084"[13]  # lid 22 lmc 0 width 4X speed 5.0 Gbs
 
Hca    1 "H-0008f10403979834"   # "Voltaire HCA410Ex-D"             - x11
[1] "S-0008f104003f2085"[24]  # lid 14 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403980ee8"   # "SGI HCA-1"
[1] "S-0008f104003f2085"[23]  # lid 27 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f1040397998c"   # "Voltaire HCA410Ex-D"             - x8
[1] "S-0008f104003f2085"[22]  # lid 10 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979e0c"   # "Voltaire HCA410Ex-D"             - x16
[1] "S-0008f104003f2085"[21]  # lid 21 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979934"   # "Voltaire HCA410Ex-D"             - x17
[1] "S-0008f104003f2085"[20]  # lid 17 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f1040397e148"   # "Voltaire HCA410Ex-D"             - xe
[1] "S-0008f104003f2085"[19]  # lid 7 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f1040397985c"   # "Voltaire HCA410Ex-D"             - x3
[1] "S-0008f104003f2085"[18]  # lid 6 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f104039798fc"   # "Voltaire HCA410Ex-D"             - x5
[1] "S-0008f104003f2085"[17]  # lid 11 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979dc0"   # "Voltaire HCA410Ex-D"             - x4
[1] "S-0008f104003f2085"[16]  # lid 9 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979850"   # "Voltaire HCA410Ex-D"             - x7
[1] "S-0008f104003f2085"[15]  # lid 8 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f1040397992c"   # "Voltaire HCA410Ex-D"             - x18
[1] "S-0008f104003f2085"[14]  # lid 23 lmc 0 width 4X speed 5.0 Gbs
Hca    1 "H-0008f10403979818"   # "Voltaire HCA410Ex-D"             - x19
[1] "S-0008f104003f2085"[13]  # lid 24 lmc 0 width 4X speed 5.0 Gbs

inter group (~6.5-9Gbit) and intra group (~11.4Gbit) bunches are clear in the below image of a large message size bunch of 8 simultaneous netpipes (kernel 2.6.9-42.0.10.EL_lustre-1.6.0.1smp, MSI-X, mthca PCI MaxReadReq 512). theoretically there should be a 0.375 (8:3) bandwidth reduction, but > ~0.5 is seen. likely because netpipe's aren't perfectly synchronised at startup or during sending so messages can pass through the limited backplane with better than expected performance. something like a b_eff should see the division more strongly.

node groupings are currently

xe,3-5,7-8,11,16-19  (11)
1-2,6,9-10,12-15     (9)

which is easiest to obtain via the ibroute command eg.

ibroute 4 | grep HCA | grep '001 '


netpipe

  • some OpenMPI (http://www.open-mpi.org/) and InfiniBand kernel module tuning netpipe's (http://www.scl.ameslab.gov/netpipe/) are shown in the plot below. NPmpi linked with OpenMPI was used to generate the below. kernel is 2.6.19.2 except the old.* curves which are 2.6.9-42.0.3.ELsmp. All have near-as-dammit 4us latency and steps in the IB protocols at 64bytes and again at 8-10kbytes.
    • Updated the last 2 curves now show where single data rate (SDR) has been set on the IB switch (enable, config, ddr, set-fabric-to-sdr) with 2.6.19.2 and a Lustre kernel and actually show a higher(!!!!) rate than before. Possibly a slightly updated (ofed 1.1) userland is the reason. NICs are still DDR even though the switch is SDR, so perhaps the backplane isn't stressed yet... I should really have run a MPIThrash before and after the DDR change :-/ power cycling the switch should reset to the (busted?) DDR settings though, so still possible.
    • Updated CentOS4.5 which ships with ofed 1.1 userland have been added. also shown are acrossSDRchips curves which show the internal bandwidth reduction due to 4x chips running at SDR inside the switch. So about 7.4 Gbit instead of 11.5 Gbit. latency is ~3.5us for the 7Gbit links, and ~3.95us for the 11Gbit links, which is a little odd as you'd think that if the slower links were going through more chips then their latency should be higher, but it's the opposite.
    • Updated an IB Verbs (NPibv) curve, and a lustre 2.6.9 kernel ddr netpipe also through the new voltaire switch were added.
    • Updated ofed1.2 with a 2.6.22.6 kernel seems to be the new clear winner. mostly just leave_pinned and MaxReadReq = 512 or 4096 curves are shown now (older curves here (http://www.cita.utoronto.ca/mediawiki/index.php/Image:IB_netpipe2.png)).
    • Curves added for when xe240's IB card is in low profile (PCIe x4) slot vs. card in it's normal x8 slot.

  • OSCAR's OpenMPI needed rebuilding for IB, but that wasn't enough as the APAC mpithrash benchmark killed the OpenMPI. so we're running OpenMPI 1.1.3b3 now.
    • update OpenMPI 1.2

b_eff

b_eff v3.5 with the switch in sdr mode, and picking configs that use 4g ram on nodes:

b_eff =   4005.675 MB/s = 250.355 *  16 PEs with 4096 MB/PE on Linux x2.cluster 2.6.9-42.0.3.EL_lustre.1.5.97smp #1 SMP Fri Jan 12 17:22:43 MST 2007 x86_64
b_eff =   6365.884 MB/s =  99.467 *  64 PEs with 1024 MB/PE on Linux x2.cluster 2.6.9-42.0.3.EL_lustre.1.5.97smp #1 SMP Fri Jan 12 17:22:43 MST 2007 x86_64

and again, but with a replacement switch that works with ddr. we still have the switch backplane bottleneck though:

b_eff =   4342.455 MB/s = 271.403 *  16 PEs with 4096 MB/PE on Linux x1 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64
b_eff =   7364.680 MB/s = 115.073 *  64 PEs with 1024 MB/PE on Linux x1 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64

looking at just 8 nodes on one side of the backplane (so all on the same ddr sub-switch) we get:

b_eff =   2446.235 MB/s = 305.779 *   8 PEs with 4096 MB/PE on Linux x2 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64
b_eff =   4093.052 MB/s = 127.908 *  32 PEs with 1024 MB/PE on Linux x2 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64

which compares to the regular 8 node result from nodes 1-2,4-9 (half on each sub-switch) of:

b_eff =   2412.054 MB/s = 301.507 *   8 PEs with 4096 MB/PE on Linux x1 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64
b_eff =   4147.746 MB/s = 129.617 *  32 PEs with 1024 MB/PE on Linux x1 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64

so basically no difference there... not sure why not.

OFED

newer OFED 1.1 (infiniband) stack doesn't rebuild via its standard build scripts, but could be worked on more... centos4.4 comes with a OFED 1.0 based stack. sles10 is 1.0 or a bit older (a beta). a recipe for installing new OFED 1.1 kernel modules into an old kernel is eg.

rpm -ivh kernel-lustre-source-2.6.9-42.0.3.EL_lustre.1.5.97.x86_64.rpm
rm -rf /lib/modules/2.6.9-42.0.3.EL_lustre.1.5.97smp/kernel/drivers/infiniband
tar xfz OFED-1.1.tgz
cd OFED-1.1/SOURCES
tar xfz openib-1.1.tgz
cd openib-1.1
./configure --with-core-mod --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mthca-mod
make
make install_modules

hmmm... although that doesn't really work as module versions are screwed up. instead let's try http://www.mail-archive.com/openib-general@openib.org/msg25052.html which (reading between the lines) means to configure OFED, then link it's infiniband/ tree into the kernel sources, and then build the kernel+newIB all in one go. ie.

 ./configure --kernel-version=2.6.9-42.0.8.ELsmp.rjh.ibInTree --modules-dir=/lib/modules/2.6.9-42.0.8.ELsmp.rjh.ibInTree --kernel-sources=/home/rjh900/rpmBuild/BUILD/kernel-2.6.9/linux-2.6.9 --with-core-mod --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mthca-mod

then link this into the real kernel tree

 cd ~/rpmBuild/BUILD/kernel-2.6.9/linux-2.6.9/drivers/
 mv infiniband infiniband.old
 ln -s /home/rjh900/build/OFED-1.1/SOURCES/openib-1.1/drivers/infiniband
 # fix the include link by copying it instead
 cd infiniband
 rm include
 cp -rd ../../include .
 # link the rdma includes in so that the build can find them
 cd ~/rpmBuild/BUILD/kernel-2.6.9/linux-2.6.9/include
 ln -s ../drivers/infiniband/include/rdma rdma

then build

 cd ~/rpmBuild/BUILD/kernel-2.6.9/linux-2.6.9
 # configure and make kernel...

... and... that doesn't work either. maybe just the infiniband/Makefile need work, but also there seem to be kernel 2.6.9 backports needed. could try this again and make sure to start from a reconfigured OFED...?? or maybe the backport patches in here will work ok: https://svn.openfabrics.org/svn/openib/gen2/branches/backport-to-2.6.9/README

Updates to a more recent ofed for rhel (http://people.redhat.com/dledford/Infiniband/openib/) userland are easy to install. it appears that the OFED userland is pretty smart and has a fairly stable API that works with multiple kernel versions. so applications don't need recompiling and sometimes just go faster when a recent enough kernel is used.

Firmware

firmware in the IB cards may be old. current (http://www.mellanox.com/support/firmware_table_IH3Lx.php) is maybe v1.2.000 and installed is 1.0.700 on backends (reported by dmesg) and 1.0.800 on the front-end. unfortunately although the cards are mellanox hardware the firmware seems to be rebadged by voltaire as /sys/class/infiniband/mthca0/board_id (https://wiki.openfabrics.org/tiki-index.php?page=MellanoxHcaFirmware) is VLT0050010001, whatever the hell that means. so I'm not sure of the best way to go about updating that.

  • firmware updated to 1.2.000 on 9 Fed 2007

lspci says

InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20)

on 8x PCIe (20Gbit).

Voltaire invoice says:

 HCA 410Ex-D x8 PCI-Exp single 4x DDR port, MemFree

which makes sense. getting new firmware requires telling Voltaire (http://voltaire.com/) your email address and a serial number for which I typed in a node_guid which seemed ok.

according to 2.6.20 kernel, firmware version 1.1.0 is current.

cat /sys/class/infiniband/mthca0/*

kernel any any
board_id VLT0050010001 VLT0050010001
fw_ver 1.0.800 1.0.700
hca_type MT25204 MT25204
hw_rev a0 a0
node_desc xe HCA-1 x1 HCA-1
node_guid 0008:f104:0397:e148 0008:f104:0397:982c
node_type 1: CA 1: CA
sys_image_guid 0008:f104:0397:e14b 0008:f104:0397:982f
  • upgrade that puppy with eg.
mstflint -d 08:00.0 -i HCA410Ex-D-25204-1_2_0.img -skip_is burn

and other useful flags are -y to make it do it anyway, as well as query options

mstflint -d 08:00.0 q
mstflint -i HCA410Ex-D-25204-1_2_0.img q

and v to verify running and firmwares in files. and save-old-firmware

mstflint -d 08:00.0 ri /tmp/old_firmware.img  (?)

Switch

  • the switch also seems to have an IP that's the same as a node's (enable->config->interface LOCAL->ip-address-local show is 192.168.1.3, same as x3ib)
    • this looks like a secondary address that's not used on ethernet, so probably it can be changed to anything
  • I can't ping any nodes from the switch's config interface at all. although backend nodes (not front-end) can ping voltaireswitch-ib ok.
  • switch's error log (enable->logs->event-log show) says the switch is in mixed SDR and DDR mode
  • see also the crash section below

Gigabit Ethernet

hardware is good old e1000. I installed newest (7.3.20) drivers for the CentOS kernel. ITR=1 is the dynamic setting, but ITR=15k still seems to work best for HPL and works with old and new e1000 drivers. 15k is the current setup.

SELinux

has to be off on the front-end and not just permissive as rpm %post and %pre scripts fail with it in permissive mode. potentially a complete SELinux relabel would fix this, but the backend OS image in the chroot would likely have the same problem, so it's better off for now.

final cluster config might want to have SELinux protected world-facing boxes and SELinux off on the master image node (which could also do pbs, maui, gmetad, ...).

I haven't noticed any speed differences with lustre tests and backends with or without selinux (permissive), so from that point of view it's not annoying.

HPL

does ok...

o.p64.ib.goto1.10.serial.memlock128M.e
WR11L2R1      121000   212     8     8            2145.63          5.504e+02

for 64 cores in non-threaded goto mode. a smaller memlock area (128M instead of 1G) for IB seems to help get a better score. that's 550.4 GF, or 8.6 GF/core, or 80.8% of peak.

over GigE doesn't fare nearly as well.

o.p64.threaded.eth.tuned.3
WR11L2R1      121000   200     4     4            2811.69          4.201e+02

so 420.1 GF, or 61.7% of peak. this is MUCH worse than for previous generations of cpus over GigE. perhaps HPL is now bandwidth limited over GigE.

HPL.dat is some slight variant of

HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
2            # of problems sizes (N)
60500 121000 125000 60500 8000 10700
5            # NBs
192 200 212 128 256 64 80 96 128 192 200 212 256 384 512 768 1024  NB
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
8 4 8          Ps
8 16 8         Qs
16.0         threshold
1            # of panel fact
2 1 0        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
1 2 4 8 16   NBMINs (>= 1)
1            # of panels in recursion
2 4 8 16     NDIVs
1            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1 2 3 4 5 0  BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1 2 4        DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

FC Raid

  • raid hardware is
    • tp9100's as explained above
    • PCI-X 133/64bit dual port Fibre Channel: LSI Logic / Symbios Logic FC949X Fibre Channel Adapter
    • disks [c-j] are on one fibre controller and [k-r] are on the other
    • each disk gets a nice consistent 72MB/s read from hdparm -Tt or a bonnie of
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
xe               2G 61248  80 67635  14 30651   5 50743  59 73587   5 389.0   0
xe               8G 60524  79 62030  13 32270   5 51818  60 73746   5 247.1   0
  • md raid1 to 2 disks gives
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
xe               2G 58362  81 63077  15 26750   4 49899  59 72982   6 551.6   0
xe               8G 57651  76 59305  14 27104   5 50656  60 72884   6 379.1   0
  • x1 has the same FC card into a tp9100 which has 16 146GB 10k rpm disks
    • Updated x1's tp9100 is now the same as xe's
    • each disk gets a not-as-fast but still consistent 66MB/s read from hdparm -Tt or a bonnie of
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
x1               2G 57336  73 59827  13 30329   5 50518  58 67300   6 357.8   0
x1               8G 55870  71 59301  13 30395   5 50364  60 66941   6 233.4   0
  • using ext3 as that's what Lustre uses
  • machine booted with mem=512M to make testing quicker and more independent of VM caching


  • all the below results are from bonnie++ with 8G files to the xe JBOD with the x-axis being the raid chunk size.
    • 16disk is simply all 16 disk /dev/sd[c-r] in a raid0/5
    • 2x8disk is disks [c-j] in one raid and [k-r] in another.
    • 2x8disk.interlaced is [c-f,k-n] in one raid set and [g-j,o-r] in the other. so the disks are split over the 2 controllers.
      • this seems to get the best aggregate throughput, although the output from the testing is noisier

setra doesn't affect write speeds. setting it to 16kb instead of letting linux choose it (depending upon chunk size? device size?) bumps up the read speeds at the small chunk size end of the read plots so that they're at about peak.

  • raid0

  • raid5

  • raid6

  • raid0,5,6,10 comparison

  • scripts for this:

xe raid testing script

xe raid plot script


is120 / SAS

xe now has an is120 (http://www.sgi.com/products/storage/tech/120.html) 12-disk SAS unit attached as well. data on the SGI site is limited, but I think it's actually a version of this gizmo LSI Engenio 1333 (http://www.lsi.com/storage_home/products_home/external_raid/1333_storage_system/index.html) - the Engenio logo on the shipping box is a bit of a hint. 2 SAS cables (which are 4x speed according to the LSI docs) lead to 2 controllers (SGI calls them ESMs) on the unit, so (I think) that means there's 24Gbit to the is120. each disk is 300G maxtor 10k rpm which can read (hdparm) at 80+MB/s. So total disk bandwidth is about 7.2Gbit. 900MB/s is the max transfer speed listed on the 1333 spec sheet.

  • Interrupt Sharing
    • the SAS controller is a dual port PCIe card which lspci says is running at x8 (20Gbit). SGI's xe240 docs suggest that the low profile PCIe slot it's connected to only does 4x though, so that's a bit confusing. it's sharing an interrupt with the IB card. However even at 4x (10Gbit) that's still > 7.2Gbit of disk bandwidth so it might be ok.
      • update: with MSI/MSI-X enabled the cards aren't sharing an interrupt options mptbase mpt_msi_enable=1. but it seems likely that they are sharing a bus, so although they both claim to be x8 devices they're probably sharing an x8, so are effectively getting x4 (10Gbit).
    • simultaneous 4g dd's to the 12 sas devices run at 1m14s instead of 1m15s when a mpithrash process (xe<->x1) is running at the same time - so no significant SAS slowdown is observed. the mpithrash process normaly sees 586MB/s/process, and with dd's it runs as slow as 400MB/s/process. some of this is likely competing for cpu time rather than PCIe bandwidth.
  • I don't think any of the xe crashes can be traced back to MSI/MSI-X being on or off. either dodgy SAS disks (should never happen!!! argh!!) or lustre rmmod's seem to be responsible
    • however MSI/MSI-X on fc/ib might have helped x1,x2 tp9100 small file Lustre stability, along with the sync; sync; sleep 30 thing.

12 disks * 2 id's device = 24 SCSI ids. in /dev/sd* terms, c->n are one 'lun' of each disk and o->z are the other (verified via smartd serial numbers of disks). I'm not sure how it's wired up inside.

12 simultaneous 1M chunk 4G dd's to the raw devices (in 3 different patterns - cdefghijklmn / cpergtivkxmz / cdefghuvwxyz) result the same numbers of 655MB/s writes and 780MB/s reads.

  • Update: when the SAS card is put onto the full height x8 PCIe slot, the times for 1M chunk 4G simultaneous dd's drops. once again there's no significant difference when striping across the two ids of each of the drives.
    • 2.6.21.3, 512m, cfq (approx same as above?) - 52.3s writes, 51.5s reads, so that's 940MB/s writes and 950MB/s reads
    • 2.6.9-55.0.2.EL_lustre.1.6.2smp kernel, 8g ram, deadline - 58.1s for writes and 49.7s for reads, so that's 845MB/s writes and 990MB/s reads
    • same kernel, 512m ram, deadline - 70s for writes, 48.7s for reads, so that's 700MB/s writes and 1010MB/s reads.

separate bonnie++'s to each disk look like

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
xe               4G 72177  91 79360  16 37708   6 62652  66 85721   6 299.3   0

which is pretty damn good. these 10k rpm disks get better bandwidth than both of the smaller capacity 10k and 15k rpm disks in the tp9100's.

  • ext3 vs. disks
    • there's a bit of a trick though - ext3 REALLY cares about partition and/or alignment that other filesystems don't. So performance on an unpartitioned disk eg. mkfs -t ext3 /dev/sdd might be wildly different to that of a partitioned disk mkfs -t ext3 /dev/sdd1 where presumably fdisk has aligned the first sector, or mkfs can otherwise read better alignment info out of the superblock or something... anyway, the upshot is (xe, 2.6.21.3, 512m ram, cfq io sched) where XFS is about the same, but ext3 blows on a raw disk.
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
               Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
    ext3 sdd     4G           48736  10 22915   4           47912   3 196.7   0
    ext3 sdd1    4G           74748  16 35240   6           78956   5 161.6   0
    xfs sdd      4G           82271  11 37246   6           80165   6 236.2   0
    xfs sdd1     4G           81565  11 37588   6           80539   6 232.1   0
      • well, actually it's a bit more complex than that - ext3 gives fairly different results in different dirs as well as partitions. eg. / /dir1 /dir2 might be all 5MB/s different. xfs is more consistent.

in a 12-disk md raid0 it gets 430MB/s writes and 350MB/s reads with setup cdefghuvwxyz.

in a 6-disk md raid0 it gets 390MB/s writes and 230MB/s reads, although some arrangements work better than others... for instance cdefgh/uvwxyz works as advertised, but cdeijk/fghlmn has fghlmn at a reduced 320 write, 200 read.

in a 4-disk md raid0 it gets 290MB/s writes and 180MB/s reads, and again there's some weirdities. in opqr/stuv/wxyz the opqr goes a bit quicker (300, 200) but overall this seems the best config. in cdef/ghij/klmn the klmn is slow (180, 130) which is ultra-odd as these are the same disks as the o->z config. in cdqr/ghuv/klyz the ghuv is slow (200, 140).

if you attach just one of the SAS cables then all 12 disks are visible with one id each (c->n), and 12 simultaneous dd's to the block devices get 655MB/s writes and 750MB/s reads which is very similar to the 2 cable case, so even 1 host cable (12Gbit?) isn't limiting speeds of the unit. or if you upgrade to the 21st century with a 2.6.20.4 kernel then it's 670MB/s writes and 790MB/s reads.

using this new kernel and setting up raid0's and doing dd'd to/from the block device - 12-disk was writes at 675MB/s and reads at 600MB/s. writing to each device in a 2*6-disk or 3*4-disk setup simultaneously, they each get the same total throughput of 670MB/s writes and 775MB/s reads. when you add a filesystem and a VM into the equation the picture isn't so pretty. separate bonnie's to ext3 get ...???

  • Update: with SAS card on PCI x8 2.6.21.3, 512m, cfq - 12-disk raid0 is 730MB/s writes, 867MB/s reads. 2*6-disk and 3*4-disk raid0's get similar writes and 964MB/s reads.

on Fri 27 Apr 2007

  • /dev/sdc and /dev/sdi were replaced with new disks via hotswap (ie. just jank em out). the confusing thing is that the SCSI system then relabeled the drives. the unique ids of the disks are now /dev/sd[c-l,w,y] instead of /dev/sd[c-n]. it looks like that when the c,i disks were removed, all the remaining ids all moved down a cog... so before the removal the 12 disks were /dev/sd[c-n] and /dev/sd[o-z] (same disks down the other SAS cable). after the removal the disks were /dev/sd[c-l] and /dev/sd[m-v]. so that when the 2 replacement disks were added in they arrived as /dev/sd[w,y] for one disk and /dev/sd[x,z] for the other disk.
  • this needs more looking into to make sure I got the above correct and that the behaviour is repeatable. I did reboot the machine before I worked out that the drives had changed letters (and had seen ext3 die from corruption on a bunch of raid tests as the same drive was in 2 parts of the raid set), so I'm not entirely sure that the mapping was the same before and after the reboot, but I think it probably was.
  • the disks weren't being used at the time of the disk replacement at all - it may have made a difference if they were already in a raid set... ???

bonnie++ to local SAS

the 12 disk raid was setup with various raid chunk sizes and in a single 12-disk, 2 6-disk and 3 4-disk configurations. the semaphore sync option was used to bonnie++ to keep the multiple bonnie++'s on xe synchronised between all the phases of the tests. 3G file size was used with default chunk size in bonnie++. xe was booted into 512M and the lustre kernel.

no attempt was made to use both SAS paths in this test. so drive IDs are all down one controller in the simplest possible way - eg. /dev/sd[c-n] (or /dev/sd[c-l,w,y] after the drive replacement)

the 3 plots shown are raid0 with the different disk arrangements, raid5 with the same, and then a comparison of raid0 and raid5 on the 2x6disk setup.

take-home points are

  • the optimal raid chunk size for each disk layout varies greatly in the raid5 read tests so needs to be chosen carefully
  • 2x6disk config with 64KB raid chunk is probably best. that sees approx raid0 write/read of 670/520 MB/s and raid5 write/read of 470/370 MB/s
  • bonnie reads from ext3 on raid are a lot slower than dd from block raid devices (from above, 12-disk write/read of 655/750)

Lustre to SAS

the 12 sas disks are arranged as 1,2 or 3 OSTs. IB, 4cpuMds, nodebug, 64K raid stripe, 1m lustre stripe. all SAS disks accessed down 1 of the 4x SAS cables (ie. /dev/sd[c-l,w,y])

so that's raid0 max write/read of 400/500 MB/s, and raid5 of 300/400 MB/s. not bad I guess for one unit, but disappointingly less than the ~1GB/s you'd think the disks were capable of. the write speeds are what drop very significantly from the local bonnie++ numbers when Lustre is added into the equation - raid0 writes dropped from 670 to 400, and raid5 writes from 470 to 300.

my fancy new MPI version of bonnie++ lets me run one bonnie++ per node (mpirun N in LAM terminology, although I'm actually using OpenMPI) or many (C). a standard non-MPI bonnie++ run with far looser synchronisation is included for comparison.

so x is a logscale now and the N curves go to 16 clients (cores, nodes), whilst the C curves go to 64 cores (16 nodes).

  • results are approx the same as the previous parallel by shell bonnie++ runs, so that's good and means I don't have to redo them all...!!!
  • write speeds definitely scale more strongly with number of nodes than they do with number of bonnie++ processes, implying that traffic is aggregated on it's way from the node to the OSS so the OSS just sees it as lots of i/o from one node... or that IB on a node is a limiting factor and that a node trying to do more i/o can't make the IB run any faster.
  • read curves are so flat that you can't really see any scaling trends at all, so all that can be said is that C runs tend to trample on each other a bit and reduce the overall throughout compared to N runs.

SATA Disks

an increasing number of the SATA disks in the xe are slowing down to <45MB/s from a peak of 60+. find the remaining fast-ish sata disks with:

cexec -p hdparm -Tt /dev/sdb | grep buffered

not sure why this is. turning off all ganglia, pbs_mom, etc. might help, but then why is it always the same disks that are slow?

  • all SATA drives have firmware versions V44OA96A except for: x4 sda,sdb and x6 sda which have V44OA80A
  • read speed seems loosely correlated with the SMART metric Raw_Read_Error_Rate as the 2 slowest disks (sdb on x4,x17):
 x4:  Timing buffered disk reads:   74 MB in  3.04 seconds =  24.37 MB/sec
x17:  Timing buffered disk reads:   86 MB in  3.07 seconds =  28.01 MB/sec
x13:  Timing buffered disk reads:  124 MB in  3.05 seconds =  40.65 MB/sec
 x3:  Timing buffered disk reads:  130 MB in  3.08 seconds =  42.15 MB/sec
 ...

also have the highest Raw_Read_Error_Rate:

smartctl-a.sdb.x17.cluster:  1 Raw_Read_Error_Rate     0x000b   088   088   016    Pre-fail  Always       -       3145766
smartctl-a.sdb.x4.cluster:   1 Raw_Read_Error_Rate     0x000b   085   085   016    Pre-fail  Always       -       2425061
smartctl-a.sda.x1.cluster:   1 Raw_Read_Error_Rate     0x000b   091   091   016    Pre-fail  Always       -       1835032
smartctl-a.sdb.x7.cluster:   1 Raw_Read_Error_Rate     0x000b   091   091   016    Pre-fail  Always       -       1310757
...

Hitachi

here's an investigation of the SATA disks in Xe backend nodes - 2x250G Hitachi Deskstar HDT722525DLA380.

hdparm sata

simple hdparm -t read tests give wildly differing and generally slow speeds. rms/stddev ->

 sda - 47.3 +/- 8.0 MB/s (min 35.0)
 sdb - 48.1 +/- 9.3 MB/s (min 23.2)

for comparision, the speeds of xe's SAS disks, ac's FC disks, and O(100's) of old IDE disks in the lc and mckenzie clusters are very uniform - stddev of +/- 0.7MB/s.

this variability is perhaps the most disturbing thing about the SATA disks in Xe.

as an experiment I put 6 of the Xe SATA disks into the is120 disk tray on Xe. here they behaved faster but still lots of scatter - 58.3 +/- 8.5 MB/s (min 43.1).

bonnie++ sata

a better estimate of disk performance/health comes from bonnie++ where the 6 SATA disks in the is120 look consistent(!) at 31MB/s writes and 67MB/s reads.

in nodes, the Hitachi disks get rms bonnie++ writes of 23MB/s and reads 49MB/s with stddev about 10. so once again, slow and variable.

as another data point I put a cheap Seagate SATA 320g 7200.10 disk into a node and it saw hdparm of 73MB/s and bonnie++ of 61MB/s writes and 70MB/s reads.

XFS is a more consistent and faster filesystem that ext3, but for these hitachi SATA disks it makes no difference to the slow speeds and huge scatter in the speeds. eg. on 5 runs over x9 to x19's sdb3 with 2.6.21.1-netswap-v12-1 kernel, 4g bonnie++ -f:

          block writes    block reads
xfs  rms    28605.3         52097.5
     ave    25655.1         50423.5
   sigma    12749.1         13200.7
     min     7092.0         25562.0
     max    47592.0         70910.0

ext3 rms    26659.3         48304.7
     ave    23835.3         46417.6
   sigma    12032.8         13472.4
     min     6375.0         22949.0
     max    48096.0         73621.0
write cache

SGI raised the issue of write cache being enabled on the Seagate SATA disk and not enabled on the Hitachi SATA disk. for some reason (ahci BIOS?) I can't enable write cache on the Hitachi disks - hdparm -W1, sdparm --set=WCE=1, and blktool wcache on all fail to do anything. However the write cache can be toggled on the 15k rpm 73G FC 8m cache ST373453FC disks in a tp9100, on the 10k rpm 300g SAS 16m cache 8J300S0 disks in the is120, and on the 146g SAS disks in xe.

here's what happens when write cache is on and off for the is120 SAS disks and the tp9100 FC disks. there is no reason to expect that SATA disks would behave any differently. 2.6.21.1-netswap-v12-1 kernel, 512m ram, bonnie++, XFS filesystem.

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
fc wc off        1G 73972  77 77667   8 39988   5 71943  72 76752   6 491.3   0
fc wc off        2G 72006  75 74534   8 38753   5 72946  71 83386   6 368.2   0
fc wc off        8G 74379  86 74023   8 37295   5 80520  79 83292   6 278.5   0

fc wc on         1G 71072  73 77028   8 36517   4 80631  80 75524   4 566.6   0
fc wc on         2G 75664  79 72598   8 36048   5 71894  70 83432   6 386.1   0
fc wc on         8G 70713  74 72784   8 34877   5 78795  81 83407   6 279.6   0

sas wc off       1G 81897  95 85798  11 39161   6 67592  75 83603   5 356.7   0
sas wc off       2G 80564  93 83328  11 36533   6 76342  85 87347   6 267.9   0
sas wc off       8G 78172  91 81222  11 34359   6 83454  92 87377   7 212.0   0

sas wc on        1G 81226  94 88301  11 39967   7 69735  77 85417   6 371.3   0
sas wc on        2G 79805  93 84472  11 35911   6 78912  88 87432   6 275.4   0
sas wc on        8G 79511  92 80507  10 34999   6 83283  92 87377   6 213.3   0

so write cache on or off makes no measurable difference to any large file tests.

                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
fc wc off       128   643   5 +++++ +++   621   3   655   5 +++++ +++   390   1
fc wc off       256   606   7 51752  69   642   3   573   6 57076  74   329   1
fc wc off       512   559  10 19932  31   602   3   524   9  6407  11   265   1

fc wc on        128  2827  20 +++++ +++  4434  18  2817  20 +++++ +++   666   2
fc wc on        256  2140  21 46312  63  2818  13  2043  20 54609  68   492   2
fc wc on        512  1528  21 13662  20  1803   8  1482  21  6348  11   393   2

sas wc off      128   724   5 +++++ +++   716   3   736   6 +++++ +++   444   2
sas wc off      256   608   7 17256  28   625   3   653   7 40237  62   297   1
sas wc off      512   630  10  9904  17   621   3   591  10  3785   7   300   1

sas wc on       128  3883  26 +++++ +++  5004  23  3833  26 +++++ +++   966   5
sas wc on       256  2514  25 23344  35  3092  16  2272  22 39073  60   601   3
sas wc on       512  1677  26  9811  17  2225  11  1570  24  3495   7   514   3

but write cache makes some difference for zero-sized files - essentially a metadata load.

zcav

zcav shows per-sector speeds. outer (first) part of disk is fastest. here's what the hitachi's look like from a diskless 2.6.21.5-ql4-12123-1 kernel, 512m ram.

all these plots are bezier smoothed in gnuplot. the zcav/zcav_write line is e.g. zcav -b 50 -c 3 -u root /dev/sda

also I hacked the zcav program (part of bonnie++) to destructively do writes to sdb. basically just s/read/write/. here are the results for 4 sas disks and for all the sdb's. one of the sdb's looks good. the others are unhappy.

for comparison here's the SAS disks on xe with a 2.6.21.3 kernel.

Update: after SGI installed rubber grommits around the 5 or 6 tiny fans in the xe210 nodes, the read and write plots now look like:

which is a definite improvement. post grommits, most disks now read ok, and more than one disk is at a decent write speed. it's now clear that around 70MB/s reads and 60MB/s writes is about the best these disks are capable of.

  • sda reads - x3, x4 disks are a bit slow at the outer edge of the platters, but possibly acceptable
  • sdb reads - x3, x4 disks are too slow. x6 is extremely slow at the inner edge
  • sdb writes - the large scatter shows that disk problems are not resolved. x4,x10,x17 are <45MB/s. outer edge of x6 is bad. x3,x8,x9 are <55MB/s. x18 is outstandingly good at 65MB/s with all the others being in the 55-64MB/s range.

Conclusions:

  • x6's sdb read shows the same drop-off at the inner edge in the previous round of tests, so this is most likely a bad disk.
  • x18's good write performance also was the same between tests - can't see any reason why this disk is particularly good
  • x3,x4's sdb read performance was uniformly bad over the rounds of testing. x3,x4 could be a bad vibration site as sda read performance isn't great either, and x3,x4 are at the low end of the sdb write range. it's unlikely all 4 disks are bad. a simple hdparm -Tt can also see x3,x4's below normal performance.
  • overall the cluster's SATA write performance probably isn't good enough

iSCSI

iSCSI was setup with an eye to swap over iSCSI via GigE or IPoIB to files on Lustre.

i/o to iSCSI

xe mounts a lustre filesystem over IB, with x1 serving 2xRaid0 and x17 being MDS in ramdisk. 16 40g non-striped files on Lustre are created. each file was setup as an iSCSI target for one node, and setup as an ext3 filesystem. I ran an mpi bonnie++ to all of these at the same time, and here's what the scalability looks like

  • iSCSI over GigE saturates the single gigE connection into xe with 2 clients
  • there's almost a factor of 2 improvement in total throughput when using IPoIB rather than gigE
  • IPoIB saturates at 4 to 8 clients
  • doing i/o to a striped file on Lustre gives universally worse performance (not shown)


below is a bonnie++ test over iSCSI and GigE to 40G un-striped file on Lustre. x18 configured with 512M ram, 2.6.21.1-netswap-v12 kernel.

  • Summary: wire speed writes, 60% wire speed reads, and metadata localy cached so very fast. writing to lustre or ramdisk or SAS made no major difference to this.
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
x18              1G 76583  98 130695  26 34794   5 73663  93 111153 7  8435   9
x18              2G 76305  97 115314  24 32200   5 73553  92 110879 7  6041   5
x18              8G 74325  96 105521  21 24097   4 48782  58  60248 4   486.7 0
x18             16G 71492  94 103929  21 24231   4 48800  58  60569 4   127.3 0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                 64 78250  96 +++++ +++ 84245  99 81162 100 +++++ +++ 83393 100
                256 58099  82 446989 100 32477 44 58480  83 +++++ +++ 19130  27
                512 40941  63 34785  25 12333  20 34735  54 27901  22  5903  10

So this seems to be an interesting way to provide 'local' disk or swap to client nodes. The bottleneck is then the number of GigE lines into the 'file server' node.

Swap to iSCSI

Swapping to an iSCSI disk is pretty easy too. patch your client's 2.6.21 kernel with Peter Zijlstra's v12-2.6.21 network deadlock avoidance / netswap patches (http://programming.kicks-ass.net/kernel-patches/vm_deadlock/), and then fire up the iSCSI client as above, and then (assuming /dev/sda is your iSCSI disk to be used for swap) mkswap /dev/sda ; swapon -v /dev/sda and bob's your uncle.

Patches are required as swapping over the network in a low memory situation is inherently risky. These patches reserver emergency memory for the use of the network stack and VM so that swap-related networking traffic will succeed.

Peter doesn't recommend swapping to NFS with kernel 2.6.21 as apparently the NFS in 2.6.21 is a bit busted. however my testing with swap to NFS with 2.6.21-rc3-netswap20070319 seemed ok. iSCSI seems to be lots better performance than NFS though, so not sure why you'd use NFS...

swapping out an application via iSCSI over gigE/IPoIB to an un-striped file on Lustre works at close to 100MB/s.

fg'ing a stopped job that's been kicked entirely out to swap, and letting it swap itself back in goes at a slower ~30-40MB/s.

Fake Fast Local Disk

it seems possible to reuse most of the ideas from the above iSCSI setup to create fast local disk (ie. local metadata rates) that are actually globally available disks with i/o over IB. so this isn't using iSER or SRP... or any of those IB protocols that never seems to actually get implemented.

instead mount the lustre filesystem on a node (via o2ib like usual), create a large file on lustre (striped or not), make a loopback filesystem on that file, mount it on the node, and off you go... ie. create with:

mount -t lustre x17ib@o2ib:/testfs /mnt/testfs
dd if=/dev/zero of=/mnt/testfs/big40 bs=1M count=40000
losetup /dev/loop0 /mnt/testfs/big40
mkfs -t ext3 /dev/loop0
mount /dev/loop0 /mnt/yo0/
chmod 1777 /mnt/yo0/

and then delete with:

umount /mnt/yo0
losetup -d /dev/loop0
umount /mnt/testfs/

once the 'disk' is setup, then it can be mounted and unmounted with:

mount -o loop -t ext3 /mnt/testfs/big40 /mnt/yo0
... use
umount /mnt/yo0
losetup -d /dev/loop0

where the losetup -d is required otherwise loopback devices are never free'd and increment upwards.

bow before the massive metadata rates on a global (well, kinda) filesystem.

  • Summary
    • all file i/o uses the local page caches just like local disk does
    • the metadata rates are the same as for local disk
    • big file i/o goes over IB to lustre and gets about 10%-20% better(!) write speeds than native lustre
    • reads are about 60% of lustre read speed
    • writes backed by striped lustre loopbacks go faster than non-striped, whilst striped reads are slower

2.6.9-42.0.10.EL_lustre-1.6.0.1smp with ext3, non-striped lustre: (machine with 8g ram)

Version  1.03       ------Sequential Create------ --------Random Create--------
x12                 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                 32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                 64 84228  89 +++++ +++ 86451  89 83134  88 +++++ +++ 85161  89
                128 77929  89 +++++ +++ 83985  89 75981  86 +++++ +++ 81148  89
                256 71690  87 477482 89 48520  56 70664  85 +++++ +++ 23545  29
                512 52607  69 475765 89 11511  14 48361  64 613948 89  5740   8
               1024 48351  69 462184 89  5453   8 46999  67 608503 89  3058   5

and there's zero load on OSS and MDS during the smaller runs. lustre thinks only one file is open, no lock contention. let's try non-zero sized files, but still small (16B to 100KB) so mostly metadata dominated: (machine with 8g ram)

Version  1.03       ------Sequential Create------ --------Random Create--------
x12                 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
    16:100000:16/64  1827  17 +++++ +++ 17606  50  2208  21 +++++ +++ 26256  86
   128:100000:16/64  1648  16   449   2   695   1  1596  16    86   0  1855   5

on large i/o (machine rebooted with 512m ram):

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
x12            128M 72156  99 +++++ +++ +++++ +++ 91639  99 +++++ +++ +++++ +++
x12            256M 72683  99 485796  99 52329 11 69566  77 76212   6 441.9   0
x12            512M 72887  99 165825  31 62797 12 68619  77 79420   5 175.4   0
x12              1G 78097  98 131002  26 67751 13 63895  81 79575   5 129.9   0
x12              2G 77940  98 134448  27 66699 17 65141  83 81391   5 111.7   0
x12              4G 77697  98 134256  27 67460 14 65195  83 79830   6  99.8   0
x12              8G 77014  97 131714  26 67819 17 65392  84 79036   6  87.5   0

which is a little slow at reads, but not terrible. striping the file across the 2 lustre OSTs we should see performance improve whilst the metadata rates shouldn't be much worse?? here's striped across 2 osts on 1 oss. (machine with 512m ram)

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
x12            128M 73316  99 +++++ +++ +++++ +++ 91707  99 +++++ +++ +++++ +++
x12            256M 72631  99 427854  90 51752   9 60759  70 64470   4 538.9   0
x12            512M 72423  99 246045  53 56322  13 57555  67 63500   4 184.4   0
x12              1G 71383  97 235585  46 58490  16 59704  70 65358   5 129.1   0
x12              2G 71894  98 228749  48 56504  15 58417  68 63242   4 109.6   0
x12              4G 78045  98 219182  46 57612  16 59031  80 64611   5  98.8   0
x12              8G 70677  98 219016  45 57535  15 59420  69 64285   4  86.7   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                 32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                 64 81564  97 +++++ +++ 84624  95 80576  96 +++++ +++ 85554  99
                128 73489  95 +++++ +++ 84933  99 72583  93 +++++ +++ 81601  99
                256 65628  92 148107 82 53049  75 62658  88 161134 99 36261  55
                512 56676  86 112911 80 13359  20 49657  74 87099  71 10655  17
               1024 44838  73  3913   3  3190   5 45386  73  2348   2   979   2

so that's much faster big writes, but slower reads. not sure I understand that. metadata appears slower, but that's just because the node has 512m of ram and not the 8g in the above test.


taking off the loopback layer and doing i/o to lustre instead, metadata rates are exceedingly low in comparison (and have been analysed in other sections), so I won't bother to show them. but big file i/o is: (machine with 512m ram)

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
 non-striped:
x12            128M 82909  99 124302  67 90004  99 89981  99 170597 100 +++++ +++ 
x12            256M 83084  99 142505  78 91169  99 90176  99 173159  99 +++++ +++
x12            512M 82526  99 110843  60 70100  97 71913  97 113839  99  1503  10
x12              1G 74414  99 130111  72 71537  99 70845  99 113637  99 764.6   7
x12              2G 74760  99 121605  68 71180  99 70960  99 112843  99 592.0   6
x12              4G 74436  99 122900  67 70235  98 71656  99 113886  99 517.6   6
x12              8G 73548  99 121984  67 69810  99 70903  99 114317  99 483.9   5
 striped:
x12            128M 82165  99 178758  99 87978  99 89914 100 169667 100 +++++ +++
x12            256M 82609  99 177694  99 88034  99 90425  99 170166 100 +++++ +++
x12            512M 81175  98 162853  91 68440  98 72959  99 108344  99  1898  14
x12              1G 74927  99 175545  93 69474  99 70921  99 108706  99 772.2   7
x12              2G 74235  99 180464  97 69759  99 71413  99 110044  99 576.8   6
x12              4G 74590  99 183391  98 68781  98 71322  99 109869  99 500.3   5
x12              8G 73554  99 181112  99 68721  99 70639  99 108354  99 483.7   5

and taking off another layer again the raw disk (actually raw fc md 8-disk raid0) performance is: (machine with 512m ram)

Version  1.03       ------Sequential Create------ --------Random Create--------
x2                  -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                 32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                 64 83309  98 +++++ +++ 79531  91 83088  97 +++++ +++ 86490 100
                128 77609  98 +++++ +++ 85271  99 75764  95 +++++ +++ 82647  99
                256 60415  83 307286 99 42521  56 62434  87 254417 98 20749  30
                512 54449  82 109030 74 12853  20 49841  76 72260  59  7031  12
               1024 47961  78  7217   5  6037  11 41890  68  4536   4  3928   8
or with 8g of ram:
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                 32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                 64 81876  98 +++++ +++ 86616 100 82924  98 +++++ +++ 84999 100
                128 76866  98 +++++ +++ 84830  99 69894  88 +++++ +++ 81624  99
                256 68219  92 475896 100 41685 53 65176  88 +++++ +++ 20646  28
                512 61219  89 474143 100 12885 19 48597  71 613828  99 6788  11
               1024 50352  78 466672  99  7069 11 43718  69 602088 100 4008   8

which is the same metadata rates as going via the loopback device.

large local file i/o (without loopback and lustre) is faster overall and in particular at the small file end indicating that the use of the loopback device chewing up some ram (or maybe it's using the pagecache twice?) and so the memory left for caching is reduced.

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
x2             128M 73095  98 +++++ +++ +++++ +++ 90316  99 +++++ +++ +++++ +++
x2             256M 72417  99 497033  99 140256 21 91994 99 +++++ +++ +++++ +++
x2             512M 71746  98 287194  63 78749 12 82802  92 209694 16  4843   3
x2               1G 77918  98 196828  41 84407 14 74542  95 176751 12  1117   1
x2               2G 77364  97 197818  39 82139 12 74955  96 178797 13 755.3   1
x2               4G 76654  96 192318  40 80824 13 75171  96 177510 13 594.5   1
x2               8G 77415  98 190082  40 83212 13 75069  97 177902 13 486.4   1


as there's been significant work done on loopback in more recent kernels, it's probably worth trying one out. so over the loopback to lustre again with a patchless 2.6.21 kernel (2.6.21.5-ql4-12123 with 512m ram):

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
 non-striped:
x12            128M 79648  98 +++++ +++ +++++ +++ 91266  98 +++++ +++ +++++ +++
x12            256M 80577  99 121651  22 46328   8 79087 86 149902  11 561.9   0
x12            512M 79214  99 125270  25 58903  10 76333 82 169140   9 179.2   0
x12              1G 78711  97 104556  21 63247  11 83092 91 167043  12 133.2   0
x12              2G 71558  89 104218  19 59100  10 86652 95 169350  12 111.8   0
x12              4G 43307  53  21161   4 56703   9 65743 71 169863   9 100.7   0
x12              8G 72987  95  68835  13 57292  10 82749 97 168610  12  89.8   0
 striped:
x12            128M 77649  95 195880  37 +++++ +++ 92029  99 +++++ +++ +++++ +++
x12            256M 79538  98 207232  39 96931  16 73016  80 296909  25 600.7   0
x12            512M 79120  98 225059  42 84653  14 84748  94 299972  24 184.4   0
x12              1G 79311  99  43567   9 86541  14 80913  89 280136  20 129.7   0
x12              2G 77980  97 169843  33 80579  15 82554  91 298408  24 109.6   0
x12              4G 78255  97 162985  31 82264  14 86364  93 308963  23  99.1   0
x12              8G 36145  45 130442  25 73408  12 87381  96 288672  22  88.0   0
 non-striped metadata:
                   ------Sequential Create------ --------Random Create--------
                   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
             files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                64 78374  95 +++++ +++ 84748  99 79149  98 +++++ +++ 83973  99
               128 37261  48 +++++ +++ 78452  98 70158  90 +++++ +++ 79655  99
               256 56365  80 94341  56 17367  25 51654  74 168267 100 12201  19
               512 40524  62 24899  17  8750  13 53291  82 14114  11  4729   8
              1024  (kaboom - lustre or loopback screwed up - Expected 1048576 files but only got 1048577)
              1024 29138  47  3879   3  3392   5 26584  43  3337   3   962   2
or with 8g of ram:
                16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                64 76667  97 +++++ +++ 83189  99 77471  95 +++++ +++ 85477  99
               128 73380  97 +++++ +++ 80099  99 67046  89 +++++ +++ 79076 100
               256 64384  91 501567 99 55722  75 63134  89 +++++ +++ 23581  33
               512 61282  92 493828 99 12559  18 60344  89 656636 100 6185   9
              1024 54117  85 473407 99  5653   8 53991  84 630141 100 3209   5
striped metadata:
Version  1.03       ------Sequential Create------ --------Random Create--------
x12                 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
             files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                64 77812  96 +++++ +++ 84960 100 78260  97 +++++ +++ 87045 100
               128 71789  96 +++++ +++ 82914  99 70483  92 +++++ +++ 80925 100
               256 50736  73 115843 68 18204  26 48132  69 163490 100 12450  19
               512 46153  70 35308  26 16001  24 33450  51 21343  17  8670  14
              1024 27756  45  3687   2  4622   7 24760  40  3380   3  1278   2

so that is a LOT better than the 2.6.9 kernel at large striped reads (was 64MB/s, now 280MB/s), but worse at writes (was 220MB/s, now an erratic 130-160MB/s).

and using xfs as the filesystem instead of ext3 we see with a 512m ram, patchless 2.6.21.5-ql4-12123, striped lustre file:

 Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
 Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
x12            128M 86319 100 +++++ +++ +++++ +++ 91543 100 +++++ +++ +++++ +++
x12            256M 80123  92 212284  21 31501   3 69843  93 294587  22 642.0   0
x12            512M 85736  99 180648  18 47326   6 88659  98 288020  21 179.7   0
x12              1G 40173  46 179287  16 27014   3 85657  93 305233  19 128.8   0
x12              2G 10855  12 174763  17 28946   3 91699  99 305303  17 113.2   0
x12              4G 84964  98 160715  16 30050   4 90834  98 305141  19 107.3   0
x12              8G 83637  98 159883  16 27043   3 86765  99 299432  17 101.7   0
                   ------Sequential Create------ --------Random Create--------
                   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
             files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                16   896   3 +++++ +++ 12925  42 14595  60 +++++ +++  6062  27
                32 13683  53 +++++ +++ 12498  36 12821  51 +++++ +++  5623  22
                64  8602  39 +++++ +++ 12863  46  7234  40 +++++ +++  5510  22
               128  6471  42 141294 91 11343  48  3846  26 142412 90  4427  20
               256  4114  39 21777  32  5110  25  3825  37 39300  61  2293  11
               512  3071  46 13083  22  5486  28  2973  45  4545   8  1905  10
              1024  2232  57 12239  19  4382  20  2396  61   471   1   232   1

which is better again at reads - now 300MB/s and consistent writes at 160MB/s. however the small file performance isn't stellar with XFS. the create's are especially slow, but they're all about 10x slower than ext3.

ext3 with 512m and an all 2.6.20.15-lustre-1.6.0.1-rc1-ql6 lustre setup. 2 raid0 osts on x1, mds on x17 like usual

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
x12            128M 78757  98 +++++ +++ +++++ +++ 91907  99 +++++ +++ +++++ +++
x12            256M 66811  99 363591  76 92605  27 70352  80 299893  21 526.1   0
x12            512M 77062  97 225773  48 98259  28 78223  86 294959  23 119.7   0
x12              1G 77779  98 210867  46 87436  22 86318  96 263253  22 103.1   0
x12              2G 77835  98 188822  42 57098  11 86446  95 277524  20  93.1   0
x12              4G 67827  98 47465  10 84544  19 90536  98 293536  16  71.8   0
x12              8G 22369  28 75100  16 78600  19 86958  96 309570  20  70.2   0
                   ------Sequential Create------ --------Random Create--------
                   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
             files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
                64 75602  93 +++++ +++ 84400  99 76720  96 +++++ +++ 85669  99
               128 70588  95 +++++ +++ 78892  99 70081  93 +++++ +++ 79402  99
               256 54633  78 166117  99 17705  26 63806  94 171662  99 13109  20
               512 44551  70 21755  16 13657  21 36127  56 15343  13  8132  13
              1024 28977  47  3201   2  4040   6 31971  52  3292   3  1174   2
              2048 18197  31   501   0   301   0 19888  34   337   0   164   0

Later: I can't remember if this last run was to a striped lustre file or not - I presume it was. assuming that then patched 2.6.20 vs. patchless 2.6.21 is very similar, as you'd expect if metadata wasn't critical.

and if you want to save the time of building scratch disks of various sizes then a bunch of say, 20g loopback files can be pre-created, and then multiple loop devices created on a node, and the loop0, loop1, ... raided together with md or with lvm to make a bigger scratch disk. layers upon layers upon layers...

Blas and Lapack

-O3 compiler options used throughout.

blas

g77 + mkl 9.0 fails 5 of the netlib blas1 tests. goto 1.15 passes them all. netlib's reference blas implementation (3.1.1) passes them all when compiled with gfortran or g77, but fails 10 tests when compiled with ifort 10.0.017

lapack

gfortran compiled lapack gets 269 failures in 11 tests, which is the best of them.

ifort compiled lapack gets 529 failures over 25 tests, mostly in dgd,sgd,dgg,sgg.

gfortran/g77 with goto blas 1.16 (includes a fixed core2 dgemv) gets 271 failures in 13 tests, so that's close to minimum.

g77 + mkl blas hangs whilst running lapack linear tests and failed ~20% of the linear tests up to this point. if testing programs are killed to allow the next test to run, then eventually (after ~10 kills) it gets 103796 failures in 131 tests, but hasn't run all the tests.

ifort + mkl gets 535 failures across tests, so a lot better than g77 + mkl.

Apps

  • tests 3,6 are nearly totally independent of i/o speeds, so they are left out of some tests
  • kernel.org kernels and 2.6.9-55.EL_lustre-1.6.1smp have their default MaxReadReq of 128 unless otherwise stated. kernel 2.6.9-42.0.10.EL_lustre-1.6.0.1smp has a default of 512
oss/ost mds/mgs client
1 11 13
2 10 12
xe 16 14


  • 1OSS, 2OST, ramdisk MDS, 16-disk fc raid0 over IB, intel10 mkl9, all kernels 2.6.9-42.0.10.EL_lustre-1.6.0.1smp, 1M lustre stripe
test time
e 4:31:07.31 (x1), 4:33:48.52 (x2)
3 1:24:57.73 (x1), 1:24:53.04 (x2)
4 3:27:49.92 (x1), 3:29:54.27 (x2)
6 5:55:19.48 (x1), 5:55:16.59 (x2)


  • same but to 1OSS, 2OST 12-disk sas on xe (dd write @ 237 MB/s)
test time
e 4:33:47.51, 4:39:27.98
3 1:24:47.28, 1:24:58.46
4 3:28:53.57, 3:28:57.37
6 5:54:29.81, 5:54:49.06


  • same as top but to loopback ext2 fs backed by 50g file on Lustre
test time
e 4:33:59.11
3 1:24:10.35
4 4:02:27.64
6 5:52:16.88


  • to xe sas raid0 (same as 2nd top) but over gigE
test time
e 5:16:37.91, 5:10:40.35
3 1:25:42.83
4 4:02:21.51, 4:02:17.60
6 -


  • same as above (xe sas raid0 gigE) but to loopback ext2 fs backed by 50g file on Lustre
test time
e 4:43:32.08, 4:43:32.41
3 1:24:11.87
4 4:14:11.53, 4:12:47.80
6 5:52:05.80


  • same as above but with /proc/sys/vm/dirty_expire_centisecs set to 180000 (30mins) (default is 3000 = 30s)
test time
e 4:40:24.84
3 -
4 4:13:14.29
6 -


  • same as 2nd top (xe) but to raid5


  • same as top, but to raid5 (dd write @ 162 MB/s)
test time
e 4:53:54.43, 4:54:18.03
3 1:25:10.82, 1:25:04.53
4 3:36:27.48, 3:35:36.38
6 5:55:48.45, 5:55:22.82


  • same as above (ie. fc x1 raid5), but to loopback ext2 fs backed by 50g file on Lustre
test time
e 4:42:43.06
3 1:24:06.09
4 4:10:00.02
6 5:51:46.69


  • same as top, but to raid5 over GigE
test time
e 5:28:37.98
3 1:25:33.21
4 4:08:19.37
6 5:57:55.91


  • same as top, but to 50g ext2 filesystem on loopback to raid5 over GigE
test time
e 4:54:47.09
3 1:24:07.01
4 4:22:03.44
6 5:51:30.15
  • same as top, but kernel 2.6.9-55.EL_lustre-1.6.1smp and to raid5 over gigE
test time
e 5:01:50.45
3 -
4 3:45:56.26
6 -


  • same as above, but to loopback ext2 fs backed by 50g file on Lustre
test time
e 4:38:21.30
3 -
4 3:56:46.48
6 -


  • same as top, but kernel 2.6.9-55.EL_lustre-1.6.1smp and to raid5
test time
e 4:29:52.33
3 -
4 3:11:00.40
6 -


  • same as above, but to loopback ext2 fs backed by 50g file on Lustre
test time
e 4:26:22.23
3 -
4 3:31:43.03
6 -


  • same as top, but to raid5 and patchless 2.6.22.4 lustre 1.6.1 client kernel. 2.6.9-55.EL_lustre-1.6.1smp on MDS/OSS
test time
e 4:30:39.39
3 1:23:16.40
4 3:11:38.30
6 5:48:39.95


  • same as above, but to loopback ext2 or xfs fs backed by 50g file on Lustre
test time
e 4:25:22.50 (ext2), 4:26:26.62 (xfs)
3 1:23:29.13
4 3:26:22.63 (ext2), 3:19:09.60 (xfs)
6 5:49:26.41


  • same as top, but over gigE and to raid5 and patchless 2.6.22.4 lustre 1.6.1 client kernel. 2.6.9-55.EL_lustre-1.6.1smp on MDS/OSS
test time
e 4:59:14.24
3 -
4 3:41:55.50
6 -


  • same as above, but to loopback ext2 or xfs fs backed by 50g file on Lustre
test time
e 4:45:59.82 (xfs)
3 -
4 3:50:12.86 (xfs)
6 -


  • same as above, (gigE, raid5, 2.6.22.4 client, 2.6.9-55.EL_lustre-1.6.1smp on MDS/OSS) except with a cpu or io intensive job running on the OSS. to native lustre on the client, or to loopback XFS.
test io io, loopback XFS cpu cpu, loopback XFS
e 5:08:25.90 4:55:59.05 5:03:00.11 4:48:51.24
4 3:48:16.44 3:59:07.66 3:45:29.66 3:53:53.29


  • runs to lustre filesystem (via the lo network interface) on the OSS whilst the above were running on the client
test time
e 4:49:36.42, 4:54:34.17, 4:39:47.22
3 1:33:14.57, 1:32:03.50, 1:28:15.36
4 3:36:49.01, 3:30:16.65
6 6:06:00.62, 6:18:43.35


  • same as top, but patched 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 kernel
test time
e 4:14:10.67
3 1:23:39.51
4 3:07:49.81
6 5:50:56.24


  • same as above, but MaxReadReq=4096 (ie mthca tune_pci=1)
test time
e 4:15:22.96
3 -
4 3:06:39.27
6 -


  • same as top, but patched 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 kernel and to loopback ext2, ext3 and XFS filesystems
test time
e 4:12:53.76 (ext2), 4:19:37.79 (ext3), 4:11:29.38 (xfs)
3 -
4 3:16:35.76 (ext2), 3:20:52.34 (ext3), 3:10:11.53 (xfs)
6 -


  • same as above, but MaxReadReq=4096 (ie mthca tune_pci=1) and just the usual ext2
test time
e 4:11:44.77
3 -
4 3:16:26.09
6 -


  • same as top, but 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 modern kernel over gigE
  • same as top, but patched 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 kernel over gigE and to loopback ext2 filesystem
  • same as top, but to raid5 and 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 modern kernel
    • kaboom
  • same as top, but to raid5 and 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 modern kernel and to loopback ext2, ext3, xfs


  • same as 2nd (ie. xe raid0 ib), but client has a patched 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 kernel. MDS and OSS still have 2.6.9 lustre kernel
test time
e 4:20:16.52
3 -
4 3:07:38.24
6 -


  • same as above but to ext2, xfs loopback
test time
e 4:13:21.20 (ext2), 4:12:36.92 (xfs)
3 -
4 3:18:30.49 (ext2), 3:12:53.61 (xfs)
6 -


  • same as 2nd top (xe sas raid0 ib), but patchless 2.6.22-ql6-rc1 client kernel. OSS/MDS are still 2.6.9-42.0.10.EL_lustre-1.6.0.1smp
test time
e 4:18:24.25
3 -
4 3:07:32.42
6 -


  • same as above, but to loopback ext2 fs backed by 50g file on Lustre
test time
e 4:13:09.91
3 -
4 3:18:48.82
6 -


  • same as top, but goto1.16 instead of mkl's blas
  • same as top, but gfortran and goto1.16


  • same as top, except 2OSS, 4OST, x1 with 2*8-disk raid0, x2 with 2*7-disks raid0
test time
e 4:30:08.17
3 1:25:10.58
4 3:32:46.33
6 5:57:10.02


  • 3oss 15ost centos5.1, r5, exe rebuilt for centos5.1 with intel 10.1 compilers, mkl9, running on 'x1' 2.6.18-53.1.13.el5-lustre1.6.4.2rjh to Lustre /short
test time
e 4:13:42.93
3 1:24:10.57
4 3:06:05.56
6 5:49:03.54


  • as above, but to jobfs (jobfs files on lustre are 4-way striped)
test time
e 4:26:16.21
3 1:24:15.35
4 3:28:09.87
6 5:50:28.39


  • when doing 11 of the above at once on 11 different nodes, we see
test time to jobfs time to native lustre
4 3:51.9 +/- 5.2mins 3:26.8 +/- 4.4mins

so jobfs is still slower even when many are being run at once. so this looks like a completely non-metadata dominated test.

in another run over the full 4 tests (e,3,4,6), jobfs uses 1036510 MB (15:14:03 walltime, 18.9mb/s ave) and native uses 1132872 MB (14:34:13 walltime, 21.6mb/s ave). so jobfs uses ~10% less bandwidth due to caching.


  • 3oss 15ost centos5.1, r5, intel compilers (intel 10 runtime, but same exe as one of those above), mkl, running on 'xe' which is 2.8ghz supermicro node, 2.6.18-53.1.4.el5-lustre1.6.4.1rjh, to Lustre
test time
e 3:48:21.98
3 1:07:57.60
4 2:28:56.08
6 3:25:00.61

which is a HUGE speedup for all of them...


  • as above except MDS is now 8-core 2.8GHz node
test time
e 3:48:35.04, 3:49:11.58
3 1:07:58.65, 1:11:13.23
4 2:29:29.68, 2:30:46.26
6 3:25:00.68, 3:28:51.40


  • as above, but to jobfs ext2
test time
e 3:57:20.69
3 1:07:55.16
4 seek error, seek error
6 3:25:19.06


  • as above (jobfs ext2) but with a 2.6.23.14 kernel and patchless 1.6.4.1 lustre
    • so a newer kernel's loop doesn't seem to help... must be a loop problem, or a loop/lustre interaction problem?
test time
e 3:55:08.79
3 1:07:53.62
4 seek error
6 3:26:44.92


  • as above (jobfs ext2) but with ext2 rebuilt so it fits into loop0(!). 2.6.18-53.1.4.el5-lustre1.6.4.1rjh kernel
    • so loop is slower than native lustre. I wonder if the application is actually metadata intensive at all?
test time
e 3:57:30.98
3 1:07:47.47
4 2:50:55.49
6 3:25:20.21


  • as above, but ext3 loop. 2.6.18-53.1.4.el5-lustre1.6.4.2rjh kernel
test time
e 4:06:16.49
3 1:08:08.58
4 2:54:30.53
6 3:25:49.27


  • 3oss 15ost centos5.1, 4coreMDS, exe rebuilt for centos5.1 with PGI 7.1-2 compilers, mkl9, running on a compute node 2.6.18-53.1.13.el5-lustre1.6.4.2rjh to 4-way striped Lustre /short
test time
e 4:11:02.76
3 1:37:59.21
4 3:06:15.12
6 6:29:33.26

so e, 4 (disk bound) are the same as intel, 3 and 6 (compute bound) are 16.6% and 11.2% slower than the intel versions.

  • as above, but with ATLAS libs instead of mkl9
test time
e
3 1:50:13.02
4
6 6:55:01.55

which on 3,6 is 12% and 7% slower than pgi+mkl, and they're 31% and 19% slower than intel+mkl.


  • on ac to /fast - best of 2 runs. jobfs is usually slower. default install - intel8, scsl
test time
e 6:48:51.09
3 1:48:56.07
4 4:15:13.91
6 8:44:14.43


  • on ac to jobfs, rebuilt with intel8 scsl
test time
e 7:13:59.74
3 1:50:18.46
4 4:59:19.80
6 9:11:11.56


  • on ac to jobfs, intel10 scsl
test time
e 6:56:56.47
3 1:50:35.27
4 4:16:07.25
6 crash & >11hours


  • on ac sles10 node to /fast with intel10 mkl9
test time
e crash
3 crash
4 crash
6 12:00:19.77

Raid1 Boot

transfer from a 1-disk system to a raid1 system is pretty easy. partition up the spare disk as you want.

mdadm --create /dev/md0 /dev/sdb1 missing
mkfs -t ext3 /dev/md0
mount /dev/md0 /mnt/newroot
rsync -avxHP / /mnt/newroot/

edit the old fstab to point to the new partitions (eg. LABEL=/ ==> /dev/md0) and re-make the initrd image so that the raid1 module is included. edit the old grub.conf so that it has root=/dev/md0. reboot and you should be in the new raid1 root disk now. clone the new disk's partition table back to the previous root disk with

sfdisk -d /dev/sdb | sfdisk /dev/sda

then

mdadm --add /dev/md0 /dev/sda1

and wait for rebuilds. fix up grub so that there's one on each superblock

# grub
grub> root (hd0,0)
grub> setup (hd0)
grub> root (hd1,0)
grub> setup (hd1)
grub> quit

and have a default=0/fallback=1 entry in grub.conf that points to hd(0,0)/hd(1,0)

Other

not strictly Xe related, but LD_PRELOAD hacks to try to replace st_blksize with sane/larger value all fail when a fopen() or open() and then fread() is done. a fstat() is called from the fread() (glibc? kernel?) which gets the filesystem default st_blksize value and no amount of __xstat/__fxstat/... trickery seems to be able to override this internal(?) fstat. sigh

Lustre's default st_blksize on a 1m striped fs is 2m. cxfs has something huge (bug via many many raid st_blksize's accumulating) or 16k.

Errors and Hardware problems

See Xe Errors