Xe
From CITA Computing
| Table of contents |
See Also
Xe Lustre and Xe Errors and Xe Production
TODO
give nodes fixed ip adresses 'cos at the moment if dhcp fails it de-configures the eth interfaces if the front-end is down for too long (it probably shouldn't do that - I don't remember it doing it before - maybe it's an ifcfg-eth0 setting?). can still get in on the IB interfaces or over SoL and fix the problem though, so it's not super-major.
Events
See also Xe Errors
- Nov 28 2007
- put qla2xxx in x17,18,19 and attached each to 3 daisy-chained 73G tp9100's
- using ac1,2's lower ix brick jobfs fibres and patch panels to do this
- Sep 27 2007
- moved x1's FC card to xe. reseated x1's memory. (a day later moved the FC card back to x1)
- lots of iSER. various OFED versions tested with centos5.
- trying single xe OSS with both SAS and FC disks (not spectacular, 3oss's (2 fc, 1 sas) much snazzier)
- Aug 31 2007 - lustre 1.6.2 and centos5 are the defaults for clients now. head node still centos4.
- iscsi RDMA testing
- lustre 1.6.1 bug reports
- lustre quotas
- Aug 18 2007 - reseated all disks in x2's tp9100 and the 2 or 3 slow disk problem seems to have gone away for now.
- working on getting commercial package built on x86_64 with gcc4 and gfortran
- everything except input parser seems to be working
- ifort version still quicker though it seems. blas libs seem mostly irrelevant to speed
- July 9 2007 - disks f and j (renamed e, i by the driver now) in the tp9100 on x2 are slow - 49MB/s wheras the rest hdparm at 72
- July 6 2007 - replacement IB switch with full working DDR fabric in the backplane installed
- SATA problems being worked on by SGI now. seems like vibration vs. hitachi.
- Lustre over IB and GigE at the same time to different sets of nodes seems to work ok. quotas still to check. separate MGS and MDS/MDT tested. hitting more LBUG's than usual... got this one with the GigE/IB test:
Jul 4 21:52:30 x1 kernel: LustreError: 12049:0:(filter.c:1575:filter_iobuf_get()) ASSERTION(thread_id < filter->fo_iobuf_count) failed Jul 4 21:52:30 x1 kernel: LustreError: 12049:0:(tracefile.c:433:libcfs_assertion_failed()) LBUG
- June 20 2007 - x3 motherboard replaced and seems ok
- June 18 2007 - CentOS5 oneSIS image verified working, including kmod-xfs, and optional kernel.org kernel
- also x1's sles10 restored from a dd from x2, and x6's sles10 restored from x4
- sdb's wiped to help with the ongoing SATA testing
- tested re-install of centos4.5 to a sdb
- May 30 2007 - oneSIS image's rc.sysinit fixed up for centos 4.5 - mk-sysimage happy. x19 boots diskless lustre.
- May 28 2007 - turns out our IB switch isn't anywhere near full non-blocking - is about 1/8 bandwidth
- May 24 2007 - CentOS 4.5
- May 8 2007 - swap over network to iSCSI is working. 2.6.21.1 kernel, patched with netswap v12-2.6.21. I'm told that NFS in 2.6.21 isn't good, but netswap to NFS actually seemed to work fine with 2.6.21-rc3-netswap20070319.
- May 2 2007 - CentOS 5 image built. yum upgrade failed at all(?) %post scripts, so did a PXE install using the CD installer vmlinuz and initrd over serial console instead. worked fine. minimal tidying required. x18,x19 running it at the moment. both CentOS 5 kernel or RHEL4-lustre 1.6.0 kernel (without udev) seem to work ok. local exclude list for systemimager updated.
- running mpi bonnie++ and larger bonnie chunk tests to SAS.
- Apr 27 2007 - sdc and sdi SAS disks on xe replaced via hotswap. so the really confusing thing now is that the disks got relabeled by the SCSI drivers somehow, so that the unique ids of the disks are now /dev/sd[c-l,w,y] instead of /dev/sd[c-n]. see the is120 section for more info.
- Apr 24 2007 - a sync; sync; sleep 30 before rmmod'ing Lustre modules seems to have stopped the repeatable crashes with the small file bonnie++ runs. those have been running for about a week now with no crashes on OSSs (x1,x2 FC disks) or MDS (x17 SATA disk or ramdisk) or clients so far.
- Apr 12 2007 - a run with 4 cpus on the MDS hung part way in an umount. did a cleanLustre and killed the hung ssh processes trying to do the umount and it seems to have proceeded ok from that point.
- Apr 10 2007 - temporarily blame the 2 dying SAS disks for the xe lockups and have asked SGI for replacements. going back to lustre small file testing using FC on x1,x2 as storage. setup x1,x2 with MSI on mptbase, and MSI-X on ib_mthca. doing x18 MDS/MDT on ramdisk. lustre survived one small file run... (Apr 11) lustre on x1,x2 survived multiple small file runs. starting a 1 cpu on oss, mds set of tests to see where/if cpu power is required.
- Apr 7 2007 - turned on msi in mptbase for 2.6.20.4 and maybe the lockups have stopped now... ?? sdc and sdi's SMART data says they're dying, so need to get them replaced.
- Apr 4->7 2007 - xe locking up with local SAS raid tests alone.
- Apr 3 2007 - xe is still crashing, so moved FC card and that Lustre OSS to x2. put SAS card back into xe.
- Apr 1 2007 - lots of crashes on xe. moved login node (xe.anu.edu.au) to x7 so can ipmi reset xe when it has problems. xe's ipmi reconfigured to use channel 2 (was setup on channel 1).
- added options mptbase mpt_msi_enable=1 and now each MPT ioc gets its own interrupt instead of ioc0, ioc1, ib_mthca sharing an interrupt. might help.
- testing all fc drives together and separately to see if the crashes are due to a hardware problem there.
- 1st SAS drive (sdc) says it has 6 uncorrectable errors. dd of /dev/zero to the disk didn't fix it.
- tested msi=1 and msi-x=1 with netpipe on x2,x3 and made no difference
- Mar 23 17:00 2007 - is120 sas JBOD attached to xe
- Mar 23 10:20am 2007 - moved tp9100 connected to xe from the xe rack to the actest rack
- xe was deliberately left running a lustre job whilst the tp9100 fibre was unplugged and the unit moved. lots of scsi errors for ~1 minute, then many many more lustre errors occurred. after re-connecting the tp9100 logins to the xe were ok, but after a while it hung (while doing a cat /proc/mdstat but was still spooling out more lustre errors. the fibres were reversed (50% chance of the right order) and the node stayed hung. the xe node wasn't responding to ctrl-alt-del or sysrq, so it was hard power cycled, then raid0's restarted, lustre mounted and it was fine. the filesystem was intact, and all the files looked ok, but processes on the first node in the MPI lustre job had died leaving the rest orphaned. orphaned MPI processes were killed and the job restarted. as the MPI run was started from the xe, it can't reasonably have been expected to survive the reboot of xe intact.
- approx Mar 21 11am 2007 - IB switch set to SDR (enable, config, ddr, set-fabric-to-sdr)
- all 3(6 ports) spine chips and 2(6 ports) line chips report they're at SDR. all HBAs say they're at DDR. (enable, utilities, port-verify)
- Mar 9 14:58:11 2007 - raised the memlock limit in limits.conf to 1g (from 128m) to see if that helps with switch crashes (maybe it did as no switch crashes since?)
Hardware
Nodes
- 19 SGI xe210
dual Xeon 5150 @ 2.66GHz (4 cores/node) 8G ram. DDR2 667 on 1333 FSB IB DDR (20Gbit transport, max 16Gbit data) on PCIe x8 (20Gbit) 2x e1000 gigE (one connected) - node x1,x16-x19 has the same FC card as front-end - nodes x18,x19 have 1/2 the ram and 1/2 the number of cores (only 1 socket filled)
- front-end SGI xe240
dual Xeon 5150 @ 2.66GHz (4 cores/node) 8G ram IB DDR (20Gbit transport, max 16Gbit data) on PCIe x8 (20Gbit) 2x e1000 gigE dual fibre channel (2Gbit each, 4Gbit total = 400MB/s) on PCI-X (133/64 = 1Gbyte/s) dual port SAS controller (12Gbit each) on PCIe x8 (lspci says x8, but m/b manual says x4)
- supermicro nodes (xe, xemds)
dual Xeon 5462 @ 2.8GHz (8 cores/node) 8G ram. DDR2 800 on 1600 FSB IB DDR (20Gbit transport, max 16Gbit data) on PCIe x8 (20Gbit) built into m/b 2x e1000 gigE xemds has a PCIe FC card
Storage
- FC
10 JBOD tp9100's. one attached to xemds, 3 to each of x17-x19 4 FC ports each, but only 2 used, so 4Gbit 16 73GB 15k rpm disks
- SAS (loaner, now returned)
1 infinite storage 120 attached to xe dual controller setup. x4 SAS host interface, so 12Gbit each 12 300G 10k rpm maxtor disks
Networking
- nodes
gigE 10.0.1.[1-19] /16 x* BMC/IPMI/SOL 10.0.40.[1-19] /16 x*bmc IB 192.168.1.[1-19] /16 x*ib
- front-end
external xe.anu.edu.au BMC/IMPI/SOL not setup gigE 10.0.10.1 /16 xe IB 192.168.10.1 /16 xeib
- switches
- smc 48port model 8848m
- voltaire 288port IB DDR model 9288
10.0.20.1 smcswitch 10.0.21.1 /16 voltaireswitch 192.168.2.100 /16 voltaireswitch-ib
OS/Install
SLES 10 and scali cluster-something came on the box from SGI, but SLES is buy-ware and upgrading was a pain (licenses, pah!). propack was easily orphaned by any SLES upgrade (broken rpm dependencies) which could easily toast the SGI versions of OFED et al. neither SLES 10 nor scali seemed to have any way to image backends, or update or install packages on them, or indeed to do simple things like push out passwd files or accounts to them. overall it wasn't enjoyable. sooo....
now the cluster triple-boots SLES10, CentOS4.4, and diskless oneSIS.
- SLES10 is on the first disk of each node and is in grub.conf for each node
- invoked if pxeboot fails or redirects to localdisk
- OSCAR 5 install of CentOS 4.4 x86_64 is on the 2nd disk of each node
- invoked via a netboot'd kernel from pxe/tftp
- master copy of the OS image is /var/lib/systemimager/images/oscarimage-centos-4 on the front-end
- push out updates with cpushimage oscarimage-centos-4. excludes for the rsync are in the image in /etc/systemimager/updateclient.local.exclude
- oneSIS 2.0rc10 diskless booting with ro root over NFS and rw PBS spool dirs
- invoked via a netboot'd kernel from pxe/tftp with root on ramdisk/NFS, but could be on ramdisk/Lustre
- root of the OS the /var/lib/oneSIS/centos-4 dir on the front-end
- 2.6.20-rc4 kernel (installed to help sort out IB and swap-over-network) reveals xe's SATA disks are possibly broken with NCQ (likely no big deal). linux-ide list informed, patches tested etc. symptom is a pile of errors in dmesg. most likely these SATA drives will just be blacklisted
- OSCAR modified for fixes, bugs, and to stop it touching the sles10 partitions
- GigE switch's config modified to allow ganglia's multicast to work
- in the GUI, IGMP Snooping -> IGMP Configuration -> IGMP Status
- on the command line no ip igmp snooping
- need to boot with selinux=0 (just disabled isn't enough) otherwise rpm %pre and %post scriptlets fail for no reason
- could re-enable selinux fully, but OSCAR people claim there are problems in the OSCAR chroot'd image in that case.
add the various hacks and bugs (mostly OSCAR) etc. here at some stage.
oneSIS
the basics are to install the oneSIS (http://www.onesis.org/) rpm, edit /etc/sysimage.conf, run mk-sysimage to build the links and patch rc.sysinit, and then mk-initrd-oneSIS to build an initrd with network and NFS preloaded. run mk-sysimage as many times as you like - it's smart and only does each thing once. running nodes can also be updated with update-node [-r|-d] and diskful hierarchical sub-master (or just diskful) nodes can be updated with sync-node (I haven't tried that yet).
to get CentOS 5 working with oneSIS 2.0rc10, I rsync'd over the current OSCAR CentOS5 image (which was updated via anaconda from the OSCAR installed CentOS4), then installed the oneSIS rpm - the CentOS4 rpm works ok. I made a new distro patch for as5 (http://www.cita.utoronto.ca/~rjh/wiki/Xe/redhat-el-as5.patch) (lives on the master and not in the image) and I needed to alter the master's mk-initrd-oneSIS to use mke2fs -b 4096 ... instead of just mke2fs ..., as new RHEL/CentOS kernels don't like ext2 initrd's built with 1k blocks - cvs oneSIS might create a cpio initrd which would also solve this.
I also added a
mount -t rpc_pipefs sunrpc /var/lib/nfs/rpc_pipefs
to near the end of rc.sysinit. still, NFS gives some FS-Cache warnings... not sure what to do about that as NFS seems to be working ok without it.
IPMI/SOL
Serial-over-LAN and all of normal IPMI (eg. power control) works on all nodes. not being used on the front-end. not yet secured with good passwds/auth levels. also the machine arrived with an auth problem where anyone can get into IPMI (not SOL) without a passwd, and hence reboot and check SEL logs etc.
A cvs version of IPMI is needed to get SOL working. ipmitool-1.8.9-cvs20070110 Below is pretty much how SGI setup IPMI on nodes with SLES10. this is how to set it up from scratch with CentOS 4 on the Left hand NIC (channel 1, eth0 in Linux (2.6 kernel)). the Right hand NIC (eth1 in Linux) is channel 2. for the front-end xe node, channel=1.
- in /etc/inittab
# Added for SOL cons:1235:respawn:/sbin/agetty -h -L 115200 ttyS1
- add ttyS1 to /etc/securetty
- impitool line for SOL:
ipmitool -I lanplus -H <bmcIP> -U admin -P <passwd> -o intelplus -v sol activate
- normal IPMI command for reboot etc.
ipmitool -H <bmcIP> -U admin chassis power reset
- SOL setup IP and access:
ipmitool channel info <channel> ipmitool lan set <channel> ipaddr <someIP> ipmitool lan set <channel> netmask 255.255.0.0 ipmitool lan set <channel> auth ADMIN MD5,PASSWORD ipmitool lan set <channel> ipsrc static ipmitool lan set <channel> arp respond on ipmitool lan set <channel> arp generate on ipmitool lan set <channel> arp interval 5 ipmitool lan print <channel> ipmitool lan set <channel> access on
- setup a user called 'admin'
ipmitool user set name 2 admin ipmitool user set password 2 <some passwd> #ipmitool user priv 2 4 <channel> ipmitool channel setaccess <channel> 2 callin=on ipmi=on link=on privilege=4 ipmitool user list <channel> ipmitool user enable 2
- SOL setup continues over the lanplus interface:
ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol info ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set privilege-level admin ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set non-volatile-bit-rate 115.2 ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set volatile-bit-rate serial ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set force-encryption true ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set enabled true ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol set retry-interval 2 ipmitool -I lanplus -H <someIP> -U admin -P <some passwd> -v -o intelplus sol payload enable <channel> 2
New Node
a new node may need GUI BIOS turned off. also assumed (original) node serial console settings are RTS/CTS, 115k, vt100 (port b, legacy enabled should be ok). also check on the processor settings - previous nodes have hardware prefetch set. dual-cache line loading is probably set on all processors. might be worth playing with these two settings.
InfiniBand
main trauma here was err, no clue how to set it up! turns out there's nothing to setup really. CentOS does it for you with an /etc/init.d/openib start, then ifup ib0.
problems from there were:
- users need to be able to lock pages of memory (http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages). this is set in /etc/security/limits.conf where sizes are in KB.
- seems likely that this required size would scale with # of nodes in a job and the size of messages they're sending(?)
- 16M is too small for 16 node mpirun N HPL (ie. 16 MPI processes with 4 goto/dgemm threads per node), 128M seems ok for this test. 128M is also ok for 16 node mpirun C which is 64 MPI processes, so leave it at 128M for now... 1G was the setting for much of the testing. 1G:
- seems likely that this required size would scale with # of nodes in a job and the size of messages they're sending(?)
* soft memlock 1048576 * hard memlock 1048576
- this is diverging from IB a but, but /etc/init.d/pbs_mom also needs limits set for jobs. ie. at the top of the script
ulimit -n 32768 ulimit -l 1048576 ulimit -s unlimited
- SGI/Intel's BIOS doesn't set MaxReadReq for PCIe correctly. wisdom online has it that OFED 1.0 stacks work well, but OFED 1.1 based stacks don't do dodgy hacks for broken BIOS's and so go slowly. the upshot is that IB with newer kernels (strangely not OFED 1.1 based AFAICT ... ???) goes slowly at the minimum MaxReadReq of 256. workaround is to use stock CentOS 4.4 (RHEL AS4) kernels which set MaxReadReq to 512. 2 workaround for this are setpci where MaxReadReq can be set to whatever you like, and the tune_pci=1 option ib_mthca which seems to set it to the max of 4096.
Observed settings from lspci -vvv are:
| Kernel version | OS | MaxReadReq(bytes) |
|---|---|---|
| 2.6.16.21-0.8-smp | sles10 | 4096 |
| 2.6.9-42.0.3.ELsmp | centos4.4 | 512 |
| 2.6.9-42.0.3.EL_lustre.1.5.97smp | centos4.4 | 512 |
| 2.6.19.2 | centos4.4 | 128 |
| 2.6.18-1.2732.4.2.el5.OFED_1_1 | centos4.4 | 128 |
| 2.6.20-rc4 | centos4.4 | 128 |
| 2.6.9-42.0.10.EL_lustre-1.6.0.1smp | centos4.5 | 512 |
| anything + tune_pci=1(*) | centos4.4/4.5 | 4096 |
(*) options ib_mthca tune_pci=1 in /etc/modprobe.conf
- SGI says:
I'm not certain that there is a "correct" setting. At the moment the XE BIOS sets the default parameter to 512 and we have tested HCA's using this value and performance (although not optimal) is acceptable. If you want optimal performance you can use the tune_pci option to set MaxReadReq to 4096. To be honest, we don't know if there are any "side effects" with setting MaxReadReq to 4096. We are still in discussion with Intel over what the right thing for the BIOS to do in this case.
which sounds fair enough.
tiered switch internals
with the IB switch in SDR mode, NICs in DDR, with centos4.5, kernel 2.6.9-42.0.10.EL_lustre-1.6.0.1smp I'm seeing fast/slow groupings of nodes. so it's 11.4Gbit (as reported by NPmpi) within one of these groups, and 7.4Gbit between groups:
xe,9-13 2,14-19
where 1,3-8 are currently down or in sles10.
within one of the 2 subsets above, 6 simultaneous pairwise netpipes all give 11+Gbit. across the 2 subsets I get a scattergram of between 3.5 and 6.5 Gbit instead of the expected 7.4Gbit from SDR. so bandwidth reduction is clearly seen even when using 6 pairwise netpipes instead of maximum 12 at once that could be run with this switch configuration if 24 nodes were plugged in and on.
this means that the switch has bottlenecks in it. the below netpipe plots are mostly between x12 and x13 which are (luckily) all in the fast regime.
internally the switch looks like (from enable->utilities->port-verify)
# # Topology file: generated on Mon Apr 9 19:17:51 2007 # Printing Chassis 1 (chassis guid 0x0008f104004011a8) devid=0x5a32 switchguids=0x8f104004011a9 Chassis ISR9288 1 Spine 1 Chip 1 Switch 24 "S-0008f104004011a9" # "ISR9288 Voltaire sFB-12D" smalid 4 [1] "S-0008f104003f1576"[1] width 4X speed 2.5 Gbs [2] "S-0008f104003f1577"[1] width 4X speed 2.5 Gbs devid=0x5a32 switchguids=0x8f104004011aa Chassis ISR9288 1 Spine 1 Chip 2 Switch 24 "S-0008f104004011aa" # "ISR9288 Voltaire sFB-12D" smalid 5 [1] "S-0008f104003f1576"[2] width 4X speed 2.5 Gbs [2] "S-0008f104003f1577"[2] width 4X speed 2.5 Gbs devid=0x5a32 switchguids=0x8f104004011ab Chassis ISR9288 1 Spine 1 Chip 3 Switch 24 "S-0008f104004011ab" # "ISR9288 Voltaire sFB-12D" smalid 1 [1] "S-0008f104003f1576"[3] width 4X speed 2.5 Gbs [2] "S-0008f104003f1577"[3] width 4X speed 2.5 Gbs devid=0x5a34 switchguids=0x8f104003f1576 Chassis ISR9288 1 Line 1 Chip 1 Switch 24 "S-0008f104003f1576" # "ISR9288/ISR9096 Voltaire sLB-24D" smalid 2 [1] "S-0008f104004011a9"[1] width 4X speed 2.5 Gbs [2] "S-0008f104004011aa"[1] width 4X speed 2.5 Gbs [3] "S-0008f104004011ab"[1] width 4X speed 2.5 Gbs [13][ext 6] "H-0008f10403979814"[1] width 4X speed 5.0 Gbs - x14 [14][ext 5] "H-0008f10403979844"[1] width 4X speed 5.0 Gbs - x15 [15][ext 4] "H-0008f10403979e0c"[1] width 4X speed 5.0 Gbs - x16 [16][ext 18] "H-0008f10403979854"[1] width 4X speed 5.0 Gbs - x2 [18][ext 16] "H-0008f104039798fc"[1] width 4X speed 5.0 Gbs - x5 [19][ext 1] "H-0008f10403979818"[1] width 4X speed 5.0 Gbs - x19 [20][ext 2] "H-0008f1040397992c"[1] width 4X speed 5.0 Gbs - x18 [21][ext 3] "H-0008f10403979934"[1] width 4X speed 5.0 Gbs - x17 [22][ext 13] "H-0008f10403979850"[1] width 4X speed 5.0 Gbs - x7 [23][ext 14] "H-0008f10403979858"[1] width 4X speed 5.0 Gbs - x6 [24][ext 15] "H-0008f10403979dc0"[1] width 4X speed 5.0 Gbs - x4 devid=0x5a34 switchguids=0x8f104003f1577 Chassis ISR9288 1 Line 1 Chip 2 Switch 24 "S-0008f104003f1577" # "ISR9288/ISR9096 Voltaire sLB-24D" smalid 3 [1] "S-0008f104004011a9"[2] width 4X speed 2.5 Gbs [2] "S-0008f104004011aa"[2] width 4X speed 2.5 Gbs [3] "S-0008f104004011ab"[2] width 4X speed 2.5 Gbs [13][ext 12] "H-0008f1040397998c"[1] width 4X speed 5.0 Gbs - x8 [14][ext 11] "H-0008f10403979e30"[1] width 4X speed 5.0 Gbs - x9 [15][ext 10] "H-0008f10403979820"[1] width 4X speed 5.0 Gbs - x10 [18][ext 22] "H-0008f10403980ee8"[1] width 4X speed 5.0 Gbs - external SGI 1 [19][ext 7] "H-0008f1040397981c"[1] width 4X speed 5.0 Gbs - x13 [20][ext 8] "H-0008f10403979888"[1] width 4X speed 5.0 Gbs - x12 [21][ext 9] "H-0008f10403979834"[1] width 4X speed 5.0 Gbs - x11 [22][ext 19] "H-0008f1040397982c"[1] width 4X speed 5.0 Gbs - x1 [23][ext 20] "H-0008f1040397e148"[1] width 4X speed 5.0 Gbs - xe [24][ext 21] "H-0008f10403980d7c"[1] width 4X speed 5.0 Gbs - external SGI 2
which to me looks like the internals are wired in the typical IB tree fashion, like
which looks a lot like 12 4x DDR HBA's are going through 3 4x SDR internal uplinks (will be DDR one day)... so this really doesn't look like a full bandwidth 24port switch! more like 1/4 bw at best and 1/8 bw at the moment :-(
as of july 2007 the new DDR switch internals are:
#
# Topology file: generated on Fri Feb 11 09:55:20 2028
#
Printing Chassis 1 (chassis guid 0x0008f10400401910)
devid=0x5a37
switchguids=0x8f10400401911 Chassis ISR2012 1 Spine 1 Chip 1
Switch 24 "S-0008f10400401911" # "ISR2012 Voltaire sFB-2012" smalid 4
[1] "S-0008f104003f2084"[1] width 4X speed 5.0 Gbs
[2] "S-0008f104003f2085"[1] width 4X speed 5.0 Gbs
devid=0x5a37
switchguids=0x8f10400401912 Chassis ISR2012 1 Spine 1 Chip 2
Switch 24 "S-0008f10400401912" # "ISR2012 Voltaire sFB-2012" smalid 5
[1] "S-0008f104003f2084"[2] width 4X speed 5.0 Gbs
[2] "S-0008f104003f2085"[2] width 4X speed 5.0 Gbs
devid=0x5a37
switchguids=0x8f10400401913 Chassis ISR2012 1 Spine 1 Chip 3
Switch 24 "S-0008f10400401913" # "ISR2012 Voltaire sFB-2012" smalid 1
[1] "S-0008f104003f2084"[3] width 4X speed 5.0 Gbs
[2] "S-0008f104003f2085"[3] width 4X speed 5.0 Gbs
devid=0x5a38
switchguids=0x8f104003f2084 Chassis ISR2012 1 Line 1 Chip 1
Switch 24 "S-0008f104003f2084" # "ISR2012/ISR2004 Voltaire sLB-2024" smalid 2
[1] "S-0008f10400401911"[1] width 4X speed 5.0 Gbs
[2] "S-0008f10400401912"[1] width 4X speed 5.0 Gbs
[3] "S-0008f10400401913"[1] width 4X speed 5.0 Gbs
[13][ext 13] "H-0008f10403979844"[1] width 4X speed 5.0 Gbs
[14][ext 14] "H-0008f10403979814"[1] width 4X speed 5.0 Gbs
[15][ext 15] "H-0008f1040397981c"[1] width 4X speed 5.0 Gbs
[16][ext 16] "H-0008f10403979858"[1] width 4X speed 5.0 Gbs
[17][ext 17] "H-0008f10403979854"[1] width 4X speed 5.0 Gbs
[18][ext 18] "H-0008f1040397982c"[1] width 4X speed 5.0 Gbs
[19][ext 19] "H-0008f10403979820"[1] width 4X speed 5.0 Gbs
[20][ext 20] "H-0008f10403979e30"[1] width 4X speed 5.0 Gbs
[21][ext 21] "H-0008f10403979888"[1] width 4X speed 5.0 Gbs
[24][ext 24] "H-0008f10403980d7c"[1] width 4X speed 5.0 Gbs
devid=0x5a38
switchguids=0x8f104003f2085 Chassis ISR2012 1 Line 1 Chip 2
Switch 24 "S-0008f104003f2085" # "ISR2012/ISR2004 Voltaire sLB-2024" smalid 3
[1] "S-0008f10400401911"[2] width 4X speed 5.0 Gbs
[2] "S-0008f10400401912"[2] width 4X speed 5.0 Gbs
[3] "S-0008f10400401913"[2] width 4X speed 5.0 Gbs
[13][ext 1] "H-0008f10403979818"[1] width 4X speed 5.0 Gbs
[14][ext 2] "H-0008f1040397992c"[1] width 4X speed 5.0 Gbs
[15][ext 3] "H-0008f10403979850"[1] width 4X speed 5.0 Gbs
[16][ext 4] "H-0008f10403979dc0"[1] width 4X speed 5.0 Gbs
[17][ext 5] "H-0008f104039798fc"[1] width 4X speed 5.0 Gbs
[18][ext 6] "H-0008f1040397985c"[1] width 4X speed 5.0 Gbs
[19][ext 7] "H-0008f1040397e148"[1] width 4X speed 5.0 Gbs
[20][ext 8] "H-0008f10403979934"[1] width 4X speed 5.0 Gbs
[21][ext 9] "H-0008f10403979e0c"[1] width 4X speed 5.0 Gbs
[22][ext 10] "H-0008f1040397998c"[1] width 4X speed 5.0 Gbs
[23][ext 11] "H-0008f10403980ee8"[1] width 4X speed 5.0 Gbs
[24][ext 12] "H-0008f10403979834"[1] width 4X speed 5.0 Gbs
devid=0x6274
Hca 1 "H-0008f10403980d7c" # "SGI HCA-2"
[1] "S-0008f104003f2084"[24] # lid 26 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979888" # "Voltaire HCA410Ex-D" - x12
[1] "S-0008f104003f2084"[21] # lid 16 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979e30" # "Voltaire HCA410Ex-D" - x9
[1] "S-0008f104003f2084"[20] # lid 13 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979820" # "Voltaire HCA410Ex-D" - x10
[1] "S-0008f104003f2084"[19] # lid 25 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f1040397982c" # "Voltaire HCA410Ex-D" - x1
[1] "S-0008f104003f2084"[18] # lid 15 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979854" # "Voltaire HCA410Ex-D" - x2
[1] "S-0008f104003f2084"[17] # lid 18 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979858" # "Voltaire HCA410Ex-D" - x6
[1] "S-0008f104003f2084"[16] # lid 12 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f1040397981c" # "Voltaire HCA410Ex-D" - x13
[1] "S-0008f104003f2084"[15] # lid 19 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979814" # "Voltaire HCA410Ex-D" - x14
[1] "S-0008f104003f2084"[14] # lid 20 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979844" # "Voltaire HCA410Ex-D" - x15
[1] "S-0008f104003f2084"[13] # lid 22 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979834" # "Voltaire HCA410Ex-D" - x11
[1] "S-0008f104003f2085"[24] # lid 14 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403980ee8" # "SGI HCA-1"
[1] "S-0008f104003f2085"[23] # lid 27 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f1040397998c" # "Voltaire HCA410Ex-D" - x8
[1] "S-0008f104003f2085"[22] # lid 10 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979e0c" # "Voltaire HCA410Ex-D" - x16
[1] "S-0008f104003f2085"[21] # lid 21 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979934" # "Voltaire HCA410Ex-D" - x17
[1] "S-0008f104003f2085"[20] # lid 17 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f1040397e148" # "Voltaire HCA410Ex-D" - xe
[1] "S-0008f104003f2085"[19] # lid 7 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f1040397985c" # "Voltaire HCA410Ex-D" - x3
[1] "S-0008f104003f2085"[18] # lid 6 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f104039798fc" # "Voltaire HCA410Ex-D" - x5
[1] "S-0008f104003f2085"[17] # lid 11 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979dc0" # "Voltaire HCA410Ex-D" - x4
[1] "S-0008f104003f2085"[16] # lid 9 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979850" # "Voltaire HCA410Ex-D" - x7
[1] "S-0008f104003f2085"[15] # lid 8 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f1040397992c" # "Voltaire HCA410Ex-D" - x18
[1] "S-0008f104003f2085"[14] # lid 23 lmc 0 width 4X speed 5.0 Gbs
Hca 1 "H-0008f10403979818" # "Voltaire HCA410Ex-D" - x19
[1] "S-0008f104003f2085"[13] # lid 24 lmc 0 width 4X speed 5.0 Gbs
inter group (~6.5-9Gbit) and intra group (~11.4Gbit) bunches are clear in the below image of a large message size bunch of 8 simultaneous netpipes (kernel 2.6.9-42.0.10.EL_lustre-1.6.0.1smp, MSI-X, mthca PCI MaxReadReq 512). theoretically there should be a 0.375 (8:3) bandwidth reduction, but > ~0.5 is seen. likely because netpipe's aren't perfectly synchronised at startup or during sending so messages can pass through the limited backplane with better than expected performance. something like a b_eff should see the division more strongly.
node groupings are currently
xe,3-5,7-8,11,16-19 (11) 1-2,6,9-10,12-15 (9)
which is easiest to obtain via the ibroute command eg.
ibroute 4 | grep HCA | grep '001 '
netpipe
- some OpenMPI (http://www.open-mpi.org/) and InfiniBand kernel module tuning netpipe's (http://www.scl.ameslab.gov/netpipe/) are shown in the plot below. NPmpi linked with OpenMPI was used to generate the below. kernel is 2.6.19.2 except the old.* curves which are 2.6.9-42.0.3.ELsmp. All have near-as-dammit 4us latency and steps in the IB protocols at 64bytes and again at 8-10kbytes.
- Updated the last 2 curves now show where single data rate (SDR) has been set on the IB switch (enable, config, ddr, set-fabric-to-sdr) with 2.6.19.2 and a Lustre kernel and actually show a higher(!!!!) rate than before. Possibly a slightly updated (ofed 1.1) userland is the reason. NICs are still DDR even though the switch is SDR, so perhaps the backplane isn't stressed yet... I should really have run a MPIThrash before and after the DDR change :-/ power cycling the switch should reset to the (busted?) DDR settings though, so still possible.
- Updated CentOS4.5 which ships with ofed 1.1 userland have been added. also shown are acrossSDRchips curves which show the internal bandwidth reduction due to 4x chips running at SDR inside the switch. So about 7.4 Gbit instead of 11.5 Gbit. latency is ~3.5us for the 7Gbit links, and ~3.95us for the 11Gbit links, which is a little odd as you'd think that if the slower links were going through more chips then their latency should be higher, but it's the opposite.
- Updated an IB Verbs (NPibv) curve, and a lustre 2.6.9 kernel ddr netpipe also through the new voltaire switch were added.
- Updated ofed1.2 with a 2.6.22.6 kernel seems to be the new clear winner. mostly just leave_pinned and MaxReadReq = 512 or 4096 curves are shown now (older curves here (http://www.cita.utoronto.ca/mediawiki/index.php/Image:IB_netpipe2.png)).
- Curves added for when xe240's IB card is in low profile (PCIe x4) slot vs. card in it's normal x8 slot.
- OSCAR's OpenMPI needed rebuilding for IB, but that wasn't enough as the APAC mpithrash benchmark killed the OpenMPI. so we're running OpenMPI 1.1.3b3 now.
- update OpenMPI 1.2
b_eff
b_eff v3.5 with the switch in sdr mode, and picking configs that use 4g ram on nodes:
b_eff = 4005.675 MB/s = 250.355 * 16 PEs with 4096 MB/PE on Linux x2.cluster 2.6.9-42.0.3.EL_lustre.1.5.97smp #1 SMP Fri Jan 12 17:22:43 MST 2007 x86_64 b_eff = 6365.884 MB/s = 99.467 * 64 PEs with 1024 MB/PE on Linux x2.cluster 2.6.9-42.0.3.EL_lustre.1.5.97smp #1 SMP Fri Jan 12 17:22:43 MST 2007 x86_64
and again, but with a replacement switch that works with ddr. we still have the switch backplane bottleneck though:
b_eff = 4342.455 MB/s = 271.403 * 16 PEs with 4096 MB/PE on Linux x1 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64 b_eff = 7364.680 MB/s = 115.073 * 64 PEs with 1024 MB/PE on Linux x1 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64
looking at just 8 nodes on one side of the backplane (so all on the same ddr sub-switch) we get:
b_eff = 2446.235 MB/s = 305.779 * 8 PEs with 4096 MB/PE on Linux x2 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64 b_eff = 4093.052 MB/s = 127.908 * 32 PEs with 1024 MB/PE on Linux x2 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64
which compares to the regular 8 node result from nodes 1-2,4-9 (half on each sub-switch) of:
b_eff = 2412.054 MB/s = 301.507 * 8 PEs with 4096 MB/PE on Linux x1 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64 b_eff = 4147.746 MB/s = 129.617 * 32 PEs with 1024 MB/PE on Linux x1 2.6.22.6 #1 SMP Sat Sep 1 23:31:53 EST 2007 x86_64
so basically no difference there... not sure why not.
OFED
newer OFED 1.1 (infiniband) stack doesn't rebuild via its standard build scripts, but could be worked on more... centos4.4 comes with a OFED 1.0 based stack. sles10 is 1.0 or a bit older (a beta). a recipe for installing new OFED 1.1 kernel modules into an old kernel is eg.
rpm -ivh kernel-lustre-source-2.6.9-42.0.3.EL_lustre.1.5.97.x86_64.rpm rm -rf /lib/modules/2.6.9-42.0.3.EL_lustre.1.5.97smp/kernel/drivers/infiniband tar xfz OFED-1.1.tgz cd OFED-1.1/SOURCES tar xfz openib-1.1.tgz cd openib-1.1 ./configure --with-core-mod --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mthca-mod make make install_modules
hmmm... although that doesn't really work as module versions are screwed up. instead let's try http://www.mail-archive.com/openib-general@openib.org/msg25052.html which (reading between the lines) means to configure OFED, then link it's infiniband/ tree into the kernel sources, and then build the kernel+newIB all in one go. ie.
./configure --kernel-version=2.6.9-42.0.8.ELsmp.rjh.ibInTree --modules-dir=/lib/modules/2.6.9-42.0.8.ELsmp.rjh.ibInTree --kernel-sources=/home/rjh900/rpmBuild/BUILD/kernel-2.6.9/linux-2.6.9 --with-core-mod --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mthca-mod
then link this into the real kernel tree
cd ~/rpmBuild/BUILD/kernel-2.6.9/linux-2.6.9/drivers/ mv infiniband infiniband.old ln -s /home/rjh900/build/OFED-1.1/SOURCES/openib-1.1/drivers/infiniband # fix the include link by copying it instead cd infiniband rm include cp -rd ../../include . # link the rdma includes in so that the build can find them cd ~/rpmBuild/BUILD/kernel-2.6.9/linux-2.6.9/include ln -s ../drivers/infiniband/include/rdma rdma
then build
cd ~/rpmBuild/BUILD/kernel-2.6.9/linux-2.6.9 # configure and make kernel...
... and... that doesn't work either. maybe just the infiniband/Makefile need work, but also there seem to be kernel 2.6.9 backports needed. could try this again and make sure to start from a reconfigured OFED...?? or maybe the backport patches in here will work ok: https://svn.openfabrics.org/svn/openib/gen2/branches/backport-to-2.6.9/README
Updates to a more recent ofed for rhel (http://people.redhat.com/dledford/Infiniband/openib/) userland are easy to install. it appears that the OFED userland is pretty smart and has a fairly stable API that works with multiple kernel versions. so applications don't need recompiling and sometimes just go faster when a recent enough kernel is used.
Firmware
firmware in the IB cards may be old. current (http://www.mellanox.com/support/firmware_table_IH3Lx.php) is maybe v1.2.000 and installed is 1.0.700 on backends (reported by dmesg) and 1.0.800 on the front-end. unfortunately although the cards are mellanox hardware the firmware seems to be rebadged by voltaire as /sys/class/infiniband/mthca0/board_id (https://wiki.openfabrics.org/tiki-index.php?page=MellanoxHcaFirmware) is VLT0050010001, whatever the hell that means. so I'm not sure of the best way to go about updating that.
- firmware updated to 1.2.000 on 9 Fed 2007
lspci says
InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20)
on 8x PCIe (20Gbit).
Voltaire invoice says:
HCA 410Ex-D x8 PCI-Exp single 4x DDR port, MemFree
which makes sense. getting new firmware requires telling Voltaire (http://voltaire.com/) your email address and a serial number for which I typed in a node_guid which seemed ok.
according to 2.6.20 kernel, firmware version 1.1.0 is current.
cat /sys/class/infiniband/mthca0/*
| kernel | any | any |
| board_id | VLT0050010001 | VLT0050010001 |
| fw_ver | 1.0.800 | 1.0.700 |
| hca_type | MT25204 | MT25204 |
| hw_rev | a0 | a0 |
| node_desc | xe HCA-1 | x1 HCA-1 |
| node_guid | 0008:f104:0397:e148 | 0008:f104:0397:982c |
| node_type | 1: CA | 1: CA |
| sys_image_guid | 0008:f104:0397:e14b | 0008:f104:0397:982f |
- upgrade that puppy with eg.
mstflint -d 08:00.0 -i HCA410Ex-D-25204-1_2_0.img -skip_is burn
and other useful flags are -y to make it do it anyway, as well as query options
mstflint -d 08:00.0 q mstflint -i HCA410Ex-D-25204-1_2_0.img q
and v to verify running and firmwares in files. and save-old-firmware
mstflint -d 08:00.0 ri /tmp/old_firmware.img (?)
Switch
- the switch also seems to have an IP that's the same as a node's (enable->config->interface LOCAL->ip-address-local show is 192.168.1.3, same as x3ib)
- this looks like a secondary address that's not used on ethernet, so probably it can be changed to anything
- I can't ping any nodes from the switch's config interface at all. although backend nodes (not front-end) can ping voltaireswitch-ib ok.
- switch's error log (enable->logs->event-log show) says the switch is in mixed SDR and DDR mode
- see also the crash section below
Gigabit Ethernet
hardware is good old e1000. I installed newest (7.3.20) drivers for the CentOS kernel. ITR=1 is the dynamic setting, but ITR=15k still seems to work best for HPL and works with old and new e1000 drivers. 15k is the current setup.
SELinux
has to be off on the front-end and not just permissive as rpm %post and %pre scripts fail with it in permissive mode. potentially a complete SELinux relabel would fix this, but the backend OS image in the chroot would likely have the same problem, so it's better off for now.
final cluster config might want to have SELinux protected world-facing boxes and SELinux off on the master image node (which could also do pbs, maui, gmetad, ...).
I haven't noticed any speed differences with lustre tests and backends with or without selinux (permissive), so from that point of view it's not annoying.
HPL
does ok...
o.p64.ib.goto1.10.serial.memlock128M.e WR11L2R1 121000 212 8 8 2145.63 5.504e+02
for 64 cores in non-threaded goto mode. a smaller memlock area (128M instead of 1G) for IB seems to help get a better score. that's 550.4 GF, or 8.6 GF/core, or 80.8% of peak.
over GigE doesn't fare nearly as well.
o.p64.threaded.eth.tuned.3 WR11L2R1 121000 200 4 4 2811.69 4.201e+02
so 420.1 GF, or 61.7% of peak. this is MUCH worse than for previous generations of cpus over GigE. perhaps HPL is now bandwidth limited over GigE.
HPL.dat is some slight variant of
HPL.out output file name (if any) 6 device out (6=stdout,7=stderr,file) 2 # of problems sizes (N) 60500 121000 125000 60500 8000 10700 5 # NBs 192 200 212 128 256 64 80 96 128 192 200 212 256 384 512 768 1024 NB 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 8 4 8 Ps 8 16 8 Qs 16.0 threshold 1 # of panel fact 2 1 0 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 1 2 4 8 16 NBMINs (>= 1) 1 # of panels in recursion 2 4 8 16 NDIVs 1 # of recursive panel fact. 0 1 2 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 1 2 3 4 5 0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 1 2 4 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0)
FC Raid
- raid hardware is
- tp9100's as explained above
- PCI-X 133/64bit dual port Fibre Channel: LSI Logic / Symbios Logic FC949X Fibre Channel Adapter
- disks [c-j] are on one fibre controller and [k-r] are on the other
- each disk gets a nice consistent 72MB/s read from hdparm -Tt or a bonnie of
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
xe 2G 61248 80 67635 14 30651 5 50743 59 73587 5 389.0 0
xe 8G 60524 79 62030 13 32270 5 51818 60 73746 5 247.1 0
- md raid1 to 2 disks gives
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
xe 2G 58362 81 63077 15 26750 4 49899 59 72982 6 551.6 0
xe 8G 57651 76 59305 14 27104 5 50656 60 72884 6 379.1 0
- x1 has the same FC card into a tp9100 which has 16 146GB 10k rpm disks
- Updated x1's tp9100 is now the same as xe's
- each disk gets a not-as-fast but still consistent 66MB/s read from hdparm -Tt or a bonnie of
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
x1 2G 57336 73 59827 13 30329 5 50518 58 67300 6 357.8 0
x1 8G 55870 71 59301 13 30395 5 50364 60 66941 6 233.4 0
- using ext3 as that's what Lustre uses
- machine booted with mem=512M to make testing quicker and more independent of VM caching
- all the below results are from bonnie++ with 8G files to the xe JBOD with the x-axis being the raid chunk size.
- 16disk is simply all 16 disk /dev/sd[c-r] in a raid0/5
- 2x8disk is disks [c-j] in one raid and [k-r] in another.
- 2x8disk.interlaced is [c-f,k-n] in one raid set and [g-j,o-r] in the other. so the disks are split over the 2 controllers.
- this seems to get the best aggregate throughput, although the output from the testing is noisier
setra doesn't affect write speeds. setting it to 16kb instead of letting linux choose it (depending upon chunk size? device size?) bumps up the read speeds at the small chunk size end of the read plots so that they're at about peak.
- raid0
- raid5
- raid6
- raid0,5,6,10 comparison
- scripts for this:
is120 / SAS
xe now has an is120 (http://www.sgi.com/products/storage/tech/120.html) 12-disk SAS unit attached as well. data on the SGI site is limited, but I think it's actually a version of this gizmo LSI Engenio 1333 (http://www.lsi.com/storage_home/products_home/external_raid/1333_storage_system/index.html) - the Engenio logo on the shipping box is a bit of a hint. 2 SAS cables (which are 4x speed according to the LSI docs) lead to 2 controllers (SGI calls them ESMs) on the unit, so (I think) that means there's 24Gbit to the is120. each disk is 300G maxtor 10k rpm which can read (hdparm) at 80+MB/s. So total disk bandwidth is about 7.2Gbit. 900MB/s is the max transfer speed listed on the 1333 spec sheet.
- Interrupt Sharing
- the SAS controller is a dual port PCIe card which lspci says is running at x8 (20Gbit). SGI's xe240 docs suggest that the low profile PCIe slot it's connected to only does 4x though, so that's a bit confusing. it's sharing an interrupt with the IB card. However even at 4x (10Gbit) that's still > 7.2Gbit of disk bandwidth so it might be ok.
- update: with MSI/MSI-X enabled the cards aren't sharing an interrupt options mptbase mpt_msi_enable=1. but it seems likely that they are sharing a bus, so although they both claim to be x8 devices they're probably sharing an x8, so are effectively getting x4 (10Gbit).
- simultaneous 4g dd's to the 12 sas devices run at 1m14s instead of 1m15s when a mpithrash process (xe<->x1) is running at the same time - so no significant SAS slowdown is observed. the mpithrash process normaly sees 586MB/s/process, and with dd's it runs as slow as 400MB/s/process. some of this is likely competing for cpu time rather than PCIe bandwidth.
- the SAS controller is a dual port PCIe card which lspci says is running at x8 (20Gbit). SGI's xe240 docs suggest that the low profile PCIe slot it's connected to only does 4x though, so that's a bit confusing. it's sharing an interrupt with the IB card. However even at 4x (10Gbit) that's still > 7.2Gbit of disk bandwidth so it might be ok.
- I don't think any of the xe crashes can be traced back to MSI/MSI-X being on or off. either dodgy SAS disks (should never happen!!! argh!!) or lustre rmmod's seem to be responsible
- however MSI/MSI-X on fc/ib might have helped x1,x2 tp9100 small file Lustre stability, along with the sync; sync; sleep 30 thing.
12 disks * 2 id's device = 24 SCSI ids. in /dev/sd* terms, c->n are one 'lun' of each disk and o->z are the other (verified via smartd serial numbers of disks). I'm not sure how it's wired up inside.
12 simultaneous 1M chunk 4G dd's to the raw devices (in 3 different patterns - cdefghijklmn / cpergtivkxmz / cdefghuvwxyz) result the same numbers of 655MB/s writes and 780MB/s reads.
- Update: when the SAS card is put onto the full height x8 PCIe slot, the times for 1M chunk 4G simultaneous dd's drops. once again there's no significant difference when striping across the two ids of each of the drives.
- 2.6.21.3, 512m, cfq (approx same as above?) - 52.3s writes, 51.5s reads, so that's 940MB/s writes and 950MB/s reads
- 2.6.9-55.0.2.EL_lustre.1.6.2smp kernel, 8g ram, deadline - 58.1s for writes and 49.7s for reads, so that's 845MB/s writes and 990MB/s reads
- same kernel, 512m ram, deadline - 70s for writes, 48.7s for reads, so that's 700MB/s writes and 1010MB/s reads.
separate bonnie++'s to each disk look like
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
xe 4G 72177 91 79360 16 37708 6 62652 66 85721 6 299.3 0
which is pretty damn good. these 10k rpm disks get better bandwidth than both of the smaller capacity 10k and 15k rpm disks in the tp9100's.
- ext3 vs. disks
- there's a bit of a trick though - ext3 REALLY cares about partition and/or alignment that other filesystems don't. So performance on an unpartitioned disk eg. mkfs -t ext3 /dev/sdd might be wildly different to that of a partitioned disk mkfs -t ext3 /dev/sdd1 where presumably fdisk has aligned the first sector, or mkfs can otherwise read better alignment info out of the superblock or something... anyway, the upshot is (xe, 2.6.21.3, 512m ram, cfq io sched) where XFS is about the same, but ext3 blows on a raw disk.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
ext3 sdd 4G 48736 10 22915 4 47912 3 196.7 0
ext3 sdd1 4G 74748 16 35240 6 78956 5 161.6 0
xfs sdd 4G 82271 11 37246 6 80165 6 236.2 0
xfs sdd1 4G 81565 11 37588 6 80539 6 232.1 0
- well, actually it's a bit more complex than that - ext3 gives fairly different results in different dirs as well as partitions. eg. / /dir1 /dir2 might be all 5MB/s different. xfs is more consistent.
in a 12-disk md raid0 it gets 430MB/s writes and 350MB/s reads with setup cdefghuvwxyz.
in a 6-disk md raid0 it gets 390MB/s writes and 230MB/s reads, although some arrangements work better than others... for instance cdefgh/uvwxyz works as advertised, but cdeijk/fghlmn has fghlmn at a reduced 320 write, 200 read.
in a 4-disk md raid0 it gets 290MB/s writes and 180MB/s reads, and again there's some weirdities. in opqr/stuv/wxyz the opqr goes a bit quicker (300, 200) but overall this seems the best config. in cdef/ghij/klmn the klmn is slow (180, 130) which is ultra-odd as these are the same disks as the o->z config. in cdqr/ghuv/klyz the ghuv is slow (200, 140).
if you attach just one of the SAS cables then all 12 disks are visible with one id each (c->n), and 12 simultaneous dd's to the block devices get 655MB/s writes and 750MB/s reads which is very similar to the 2 cable case, so even 1 host cable (12Gbit?) isn't limiting speeds of the unit. or if you upgrade to the 21st century with a 2.6.20.4 kernel then it's 670MB/s writes and 790MB/s reads.
using this new kernel and setting up raid0's and doing dd'd to/from the block device - 12-disk was writes at 675MB/s and reads at 600MB/s. writing to each device in a 2*6-disk or 3*4-disk setup simultaneously, they each get the same total throughput of 670MB/s writes and 775MB/s reads. when you add a filesystem and a VM into the equation the picture isn't so pretty. separate bonnie's to ext3 get ...???
- Update: with SAS card on PCI x8 2.6.21.3, 512m, cfq - 12-disk raid0 is 730MB/s writes, 867MB/s reads. 2*6-disk and 3*4-disk raid0's get similar writes and 964MB/s reads.
on Fri 27 Apr 2007
- /dev/sdc and /dev/sdi were replaced with new disks via hotswap (ie. just jank em out). the confusing thing is that the SCSI system then relabeled the drives. the unique ids of the disks are now /dev/sd[c-l,w,y] instead of /dev/sd[c-n]. it looks like that when the c,i disks were removed, all the remaining ids all moved down a cog... so before the removal the 12 disks were /dev/sd[c-n] and /dev/sd[o-z] (same disks down the other SAS cable). after the removal the disks were /dev/sd[c-l] and /dev/sd[m-v]. so that when the 2 replacement disks were added in they arrived as /dev/sd[w,y] for one disk and /dev/sd[x,z] for the other disk.
- this needs more looking into to make sure I got the above correct and that the behaviour is repeatable. I did reboot the machine before I worked out that the drives had changed letters (and had seen ext3 die from corruption on a bunch of raid tests as the same drive was in 2 parts of the raid set), so I'm not entirely sure that the mapping was the same before and after the reboot, but I think it probably was.
- the disks weren't being used at the time of the disk replacement at all - it may have made a difference if they were already in a raid set... ???
bonnie++ to local SAS
the 12 disk raid was setup with various raid chunk sizes and in a single 12-disk, 2 6-disk and 3 4-disk configurations. the semaphore sync option was used to bonnie++ to keep the multiple bonnie++'s on xe synchronised between all the phases of the tests. 3G file size was used with default chunk size in bonnie++. xe was booted into 512M and the lustre kernel.
no attempt was made to use both SAS paths in this test. so drive IDs are all down one controller in the simplest possible way - eg. /dev/sd[c-n] (or /dev/sd[c-l,w,y] after the drive replacement)
the 3 plots shown are raid0 with the different disk arrangements, raid5 with the same, and then a comparison of raid0 and raid5 on the 2x6disk setup.
take-home points are
- the optimal raid chunk size for each disk layout varies greatly in the raid5 read tests so needs to be chosen carefully
- 2x6disk config with 64KB raid chunk is probably best. that sees approx raid0 write/read of 670/520 MB/s and raid5 write/read of 470/370 MB/s
- bonnie reads from ext3 on raid are a lot slower than dd from block raid devices (from above, 12-disk write/read of 655/750)
Lustre to SAS
the 12 sas disks are arranged as 1,2 or 3 OSTs. IB, 4cpuMds, nodebug, 64K raid stripe, 1m lustre stripe. all SAS disks accessed down 1 of the 4x SAS cables (ie. /dev/sd[c-l,w,y])
so that's raid0 max write/read of 400/500 MB/s, and raid5 of 300/400 MB/s. not bad I guess for one unit, but disappointingly less than the ~1GB/s you'd think the disks were capable of. the write speeds are what drop very significantly from the local bonnie++ numbers when Lustre is added into the equation - raid0 writes dropped from 670 to 400, and raid5 writes from 470 to 300.
my fancy new MPI version of bonnie++ lets me run one bonnie++ per node (mpirun N in LAM terminology, although I'm actually using OpenMPI) or many (C). a standard non-MPI bonnie++ run with far looser synchronisation is included for comparison.
so x is a logscale now and the N curves go to 16 clients (cores, nodes), whilst the C curves go to 64 cores (16 nodes).
- results are approx the same as the previous parallel by shell bonnie++ runs, so that's good and means I don't have to redo them all...!!!
- write speeds definitely scale more strongly with number of nodes than they do with number of bonnie++ processes, implying that traffic is aggregated on it's way from the node to the OSS so the OSS just sees it as lots of i/o from one node... or that IB on a node is a limiting factor and that a node trying to do more i/o can't make the IB run any faster.
- read curves are so flat that you can't really see any scaling trends at all, so all that can be said is that C runs tend to trample on each other a bit and reduce the overall throughout compared to N runs.
SATA Disks
an increasing number of the SATA disks in the xe are slowing down to <45MB/s from a peak of 60+. find the remaining fast-ish sata disks with:
cexec -p hdparm -Tt /dev/sdb | grep buffered
not sure why this is. turning off all ganglia, pbs_mom, etc. might help, but then why is it always the same disks that are slow?
- all SATA drives have firmware versions V44OA96A except for: x4 sda,sdb and x6 sda which have V44OA80A
- read speed seems loosely correlated with the SMART metric Raw_Read_Error_Rate as the 2 slowest disks (sdb on x4,x17):
x4: Timing buffered disk reads: 74 MB in 3.04 seconds = 24.37 MB/sec x17: Timing buffered disk reads: 86 MB in 3.07 seconds = 28.01 MB/sec x13: Timing buffered disk reads: 124 MB in 3.05 seconds = 40.65 MB/sec x3: Timing buffered disk reads: 130 MB in 3.08 seconds = 42.15 MB/sec ...
also have the highest Raw_Read_Error_Rate:
smartctl-a.sdb.x17.cluster: 1 Raw_Read_Error_Rate 0x000b 088 088 016 Pre-fail Always - 3145766 smartctl-a.sdb.x4.cluster: 1 Raw_Read_Error_Rate 0x000b 085 085 016 Pre-fail Always - 2425061 smartctl-a.sda.x1.cluster: 1 Raw_Read_Error_Rate 0x000b 091 091 016 Pre-fail Always - 1835032 smartctl-a.sdb.x7.cluster: 1 Raw_Read_Error_Rate 0x000b 091 091 016 Pre-fail Always - 1310757 ...
Hitachi
here's an investigation of the SATA disks in Xe backend nodes - 2x250G Hitachi Deskstar HDT722525DLA380.
hdparm sata
simple hdparm -t read tests give wildly differing and generally slow speeds. rms/stddev ->
sda - 47.3 +/- 8.0 MB/s (min 35.0) sdb - 48.1 +/- 9.3 MB/s (min 23.2)
for comparision, the speeds of xe's SAS disks, ac's FC disks, and O(100's) of old IDE disks in the lc and mckenzie clusters are very uniform - stddev of +/- 0.7MB/s.
this variability is perhaps the most disturbing thing about the SATA disks in Xe.
as an experiment I put 6 of the Xe SATA disks into the is120 disk tray on Xe. here they behaved faster but still lots of scatter - 58.3 +/- 8.5 MB/s (min 43.1).
bonnie++ sata
a better estimate of disk performance/health comes from bonnie++ where the 6 SATA disks in the is120 look consistent(!) at 31MB/s writes and 67MB/s reads.
in nodes, the Hitachi disks get rms bonnie++ writes of 23MB/s and reads 49MB/s with stddev about 10. so once again, slow and variable.
as another data point I put a cheap Seagate SATA 320g 7200.10 disk into a node and it saw hdparm of 73MB/s and bonnie++ of 61MB/s writes and 70MB/s reads.
XFS is a more consistent and faster filesystem that ext3, but for these hitachi SATA disks it makes no difference to the slow speeds and huge scatter in the speeds. eg. on 5 runs over x9 to x19's sdb3 with 2.6.21.1-netswap-v12-1 kernel, 4g bonnie++ -f:
block writes block reads
xfs rms 28605.3 52097.5
ave 25655.1 50423.5
sigma 12749.1 13200.7
min 7092.0 25562.0
max 47592.0 70910.0
ext3 rms 26659.3 48304.7
ave 23835.3 46417.6
sigma 12032.8 13472.4
min 6375.0 22949.0
max 48096.0 73621.0
write cache
SGI raised the issue of write cache being enabled on the Seagate SATA disk and not enabled on the Hitachi SATA disk. for some reason (ahci BIOS?) I can't enable write cache on the Hitachi disks - hdparm -W1, sdparm --set=WCE=1, and blktool wcache on all fail to do anything. However the write cache can be toggled on the 15k rpm 73G FC 8m cache ST373453FC disks in a tp9100, on the 10k rpm 300g SAS 16m cache 8J300S0 disks in the is120, and on the 146g SAS disks in xe.
here's what happens when write cache is on and off for the is120 SAS disks and the tp9100 FC disks. there is no reason to expect that SATA disks would behave any differently. 2.6.21.1-netswap-v12-1 kernel, 512m ram, bonnie++, XFS filesystem.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
fc wc off 1G 73972 77 77667 8 39988 5 71943 72 76752 6 491.3 0
fc wc off 2G 72006 75 74534 8 38753 5 72946 71 83386 6 368.2 0
fc wc off 8G 74379 86 74023 8 37295 5 80520 79 83292 6 278.5 0
fc wc on 1G 71072 73 77028 8 36517 4 80631 80 75524 4 566.6 0
fc wc on 2G 75664 79 72598 8 36048 5 71894 70 83432 6 386.1 0
fc wc on 8G 70713 74 72784 8 34877 5 78795 81 83407 6 279.6 0
sas wc off 1G 81897 95 85798 11 39161 6 67592 75 83603 5 356.7 0
sas wc off 2G 80564 93 83328 11 36533 6 76342 85 87347 6 267.9 0
sas wc off 8G 78172 91 81222 11 34359 6 83454 92 87377 7 212.0 0
sas wc on 1G 81226 94 88301 11 39967 7 69735 77 85417 6 371.3 0
sas wc on 2G 79805 93 84472 11 35911 6 78912 88 87432 6 275.4 0
sas wc on 8G 79511 92 80507 10 34999 6 83283 92 87377 6 213.3 0
so write cache on or off makes no measurable difference to any large file tests.
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
fc wc off 128 643 5 +++++ +++ 621 3 655 5 +++++ +++ 390 1
fc wc off 256 606 7 51752 69 642 3 573 6 57076 74 329 1
fc wc off 512 559 10 19932 31 602 3 524 9 6407 11 265 1
fc wc on 128 2827 20 +++++ +++ 4434 18 2817 20 +++++ +++ 666 2
fc wc on 256 2140 21 46312 63 2818 13 2043 20 54609 68 492 2
fc wc on 512 1528 21 13662 20 1803 8 1482 21 6348 11 393 2
sas wc off 128 724 5 +++++ +++ 716 3 736 6 +++++ +++ 444 2
sas wc off 256 608 7 17256 28 625 3 653 7 40237 62 297 1
sas wc off 512 630 10 9904 17 621 3 591 10 3785 7 300 1
sas wc on 128 3883 26 +++++ +++ 5004 23 3833 26 +++++ +++ 966 5
sas wc on 256 2514 25 23344 35 3092 16 2272 22 39073 60 601 3
sas wc on 512 1677 26 9811 17 2225 11 1570 24 3495 7 514 3
but write cache makes some difference for zero-sized files - essentially a metadata load.
zcav
zcav shows per-sector speeds. outer (first) part of disk is fastest. here's what the hitachi's look like from a diskless 2.6.21.5-ql4-12123-1 kernel, 512m ram.
all these plots are bezier smoothed in gnuplot. the zcav/zcav_write line is e.g. zcav -b 50 -c 3 -u root /dev/sda
also I hacked the zcav program (part of bonnie++) to destructively do writes to sdb. basically just s/read/write/. here are the results for 4 sas disks and for all the sdb's. one of the sdb's looks good. the others are unhappy.
for comparison here's the SAS disks on xe with a 2.6.21.3 kernel.
Update: after SGI installed rubber grommits around the 5 or 6 tiny fans in the xe210 nodes, the read and write plots now look like:
which is a definite improvement. post grommits, most disks now read ok, and more than one disk is at a decent write speed. it's now clear that around 70MB/s reads and 60MB/s writes is about the best these disks are capable of.
- sda reads - x3, x4 disks are a bit slow at the outer edge of the platters, but possibly acceptable
- sdb reads - x3, x4 disks are too slow. x6 is extremely slow at the inner edge
- sdb writes - the large scatter shows that disk problems are not resolved. x4,x10,x17 are <45MB/s. outer edge of x6 is bad. x3,x8,x9 are <55MB/s. x18 is outstandingly good at 65MB/s with all the others being in the 55-64MB/s range.
Conclusions:
- x6's sdb read shows the same drop-off at the inner edge in the previous round of tests, so this is most likely a bad disk.
- x18's good write performance also was the same between tests - can't see any reason why this disk is particularly good
- x3,x4's sdb read performance was uniformly bad over the rounds of testing. x3,x4 could be a bad vibration site as sda read performance isn't great either, and x3,x4 are at the low end of the sdb write range. it's unlikely all 4 disks are bad. a simple hdparm -Tt can also see x3,x4's below normal performance.
- overall the cluster's SATA write performance probably isn't good enough
iSCSI
iSCSI was setup with an eye to swap over iSCSI via GigE or IPoIB to files on Lustre.
i/o to iSCSI
xe mounts a lustre filesystem over IB, with x1 serving 2xRaid0 and x17 being MDS in ramdisk. 16 40g non-striped files on Lustre are created. each file was setup as an iSCSI target for one node, and setup as an ext3 filesystem. I ran an mpi bonnie++ to all of these at the same time, and here's what the scalability looks like
- iSCSI over GigE saturates the single gigE connection into xe with 2 clients
- there's almost a factor of 2 improvement in total throughput when using IPoIB rather than gigE
- IPoIB saturates at 4 to 8 clients
- doing i/o to a striped file on Lustre gives universally worse performance (not shown)
below is a bonnie++ test over iSCSI and GigE to 40G un-striped file on Lustre. x18 configured with 512M ram, 2.6.21.1-netswap-v12 kernel.
- Summary: wire speed writes, 60% wire speed reads, and metadata localy cached so very fast. writing to lustre or ramdisk or SAS made no major difference to this.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
x18 1G 76583 98 130695 26 34794 5 73663 93 111153 7 8435 9
x18 2G 76305 97 115314 24 32200 5 73553 92 110879 7 6041 5
x18 8G 74325 96 105521 21 24097 4 48782 58 60248 4 486.7 0
x18 16G 71492 94 103929 21 24231 4 48800 58 60569 4 127.3 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
64 78250 96 +++++ +++ 84245 99 81162 100 +++++ +++ 83393 100
256 58099 82 446989 100 32477 44 58480 83 +++++ +++ 19130 27
512 40941 63 34785 25 12333 20 34735 54 27901 22 5903 10
So this seems to be an interesting way to provide 'local' disk or swap to client nodes. The bottleneck is then the number of GigE lines into the 'file server' node.
Swap to iSCSI
Swapping to an iSCSI disk is pretty easy too. patch your client's 2.6.21 kernel with Peter Zijlstra's v12-2.6.21 network deadlock avoidance / netswap patches (http://programming.kicks-ass.net/kernel-patches/vm_deadlock/), and then fire up the iSCSI client as above, and then (assuming /dev/sda is your iSCSI disk to be used for swap) mkswap /dev/sda ; swapon -v /dev/sda and bob's your uncle.
Patches are required as swapping over the network in a low memory situation is inherently risky. These patches reserver emergency memory for the use of the network stack and VM so that swap-related networking traffic will succeed.
Peter doesn't recommend swapping to NFS with kernel 2.6.21 as apparently the NFS in 2.6.21 is a bit busted. however my testing with swap to NFS with 2.6.21-rc3-netswap20070319 seemed ok. iSCSI seems to be lots better performance than NFS though, so not sure why you'd use NFS...
swapping out an application via iSCSI over gigE/IPoIB to an un-striped file on Lustre works at close to 100MB/s.
fg'ing a stopped job that's been kicked entirely out to swap, and letting it swap itself back in goes at a slower ~30-40MB/s.
Fake Fast Local Disk
it seems possible to reuse most of the ideas from the above iSCSI setup to create fast local disk (ie. local metadata rates) that are actually globally available disks with i/o over IB. so this isn't using iSER or SRP... or any of those IB protocols that never seems to actually get implemented.
instead mount the lustre filesystem on a node (via o2ib like usual), create a large file on lustre (striped or not), make a loopback filesystem on that file, mount it on the node, and off you go... ie. create with:
mount -t lustre x17ib@o2ib:/testfs /mnt/testfs dd if=/dev/zero of=/mnt/testfs/big40 bs=1M count=40000 losetup /dev/loop0 /mnt/testfs/big40 mkfs -t ext3 /dev/loop0 mount /dev/loop0 /mnt/yo0/ chmod 1777 /mnt/yo0/
and then delete with:
umount /mnt/yo0 losetup -d /dev/loop0 umount /mnt/testfs/
once the 'disk' is setup, then it can be mounted and unmounted with:
mount -o loop -t ext3 /mnt/testfs/big40 /mnt/yo0 ... use umount /mnt/yo0 losetup -d /dev/loop0
where the losetup -d is required otherwise loopback devices are never free'd and increment upwards.
bow before the massive metadata rates on a global (well, kinda) filesystem.
- Summary
- all file i/o uses the local page caches just like local disk does
- the metadata rates are the same as for local disk
- big file i/o goes over IB to lustre and gets about 10%-20% better(!) write speeds than native lustre
- reads are about 60% of lustre read speed
- writes backed by striped lustre loopbacks go faster than non-striped, whilst striped reads are slower
2.6.9-42.0.10.EL_lustre-1.6.0.1smp with ext3, non-striped lustre: (machine with 8g ram)
Version 1.03 ------Sequential Create------ --------Random Create--------
x12 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
64 84228 89 +++++ +++ 86451 89 83134 88 +++++ +++ 85161 89
128 77929 89 +++++ +++ 83985 89 75981 86 +++++ +++ 81148 89
256 71690 87 477482 89 48520 56 70664 85 +++++ +++ 23545 29
512 52607 69 475765 89 11511 14 48361 64 613948 89 5740 8
1024 48351 69 462184 89 5453 8 46999 67 608503 89 3058 5
and there's zero load on OSS and MDS during the smaller runs. lustre thinks only one file is open, no lock contention. let's try non-zero sized files, but still small (16B to 100KB) so mostly metadata dominated: (machine with 8g ram)
Version 1.03 ------Sequential Create------ --------Random Create--------
x12 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16:100000:16/64 1827 17 +++++ +++ 17606 50 2208 21 +++++ +++ 26256 86
128:100000:16/64 1648 16 449 2 695 1 1596 16 86 0 1855 5
on large i/o (machine rebooted with 512m ram):
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
x12 128M 72156 99 +++++ +++ +++++ +++ 91639 99 +++++ +++ +++++ +++
x12 256M 72683 99 485796 99 52329 11 69566 77 76212 6 441.9 0
x12 512M 72887 99 165825 31 62797 12 68619 77 79420 5 175.4 0
x12 1G 78097 98 131002 26 67751 13 63895 81 79575 5 129.9 0
x12 2G 77940 98 134448 27 66699 17 65141 83 81391 5 111.7 0
x12 4G 77697 98 134256 27 67460 14 65195 83 79830 6 99.8 0
x12 8G 77014 97 131714 26 67819 17 65392 84 79036 6 87.5 0
which is a little slow at reads, but not terrible. striping the file across the 2 lustre OSTs we should see performance improve whilst the metadata rates shouldn't be much worse?? here's striped across 2 osts on 1 oss. (machine with 512m ram)
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
x12 128M 73316 99 +++++ +++ +++++ +++ 91707 99 +++++ +++ +++++ +++
x12 256M 72631 99 427854 90 51752 9 60759 70 64470 4 538.9 0
x12 512M 72423 99 246045 53 56322 13 57555 67 63500 4 184.4 0
x12 1G 71383 97 235585 46 58490 16 59704 70 65358 5 129.1 0
x12 2G 71894 98 228749 48 56504 15 58417 68 63242 4 109.6 0
x12 4G 78045 98 219182 46 57612 16 59031 80 64611 5 98.8 0
x12 8G 70677 98 219016 45 57535 15 59420 69 64285 4 86.7 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
64 81564 97 +++++ +++ 84624 95 80576 96 +++++ +++ 85554 99
128 73489 95 +++++ +++ 84933 99 72583 93 +++++ +++ 81601 99
256 65628 92 148107 82 53049 75 62658 88 161134 99 36261 55
512 56676 86 112911 80 13359 20 49657 74 87099 71 10655 17
1024 44838 73 3913 3 3190 5 45386 73 2348 2 979 2
so that's much faster big writes, but slower reads. not sure I understand that. metadata appears slower, but that's just because the node has 512m of ram and not the 8g in the above test.
taking off the loopback layer and doing i/o to lustre instead, metadata rates are exceedingly low in comparison (and have been analysed in other sections), so I won't bother to show them. but big file i/o is: (machine with 512m ram)
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
non-striped:
x12 128M 82909 99 124302 67 90004 99 89981 99 170597 100 +++++ +++
x12 256M 83084 99 142505 78 91169 99 90176 99 173159 99 +++++ +++
x12 512M 82526 99 110843 60 70100 97 71913 97 113839 99 1503 10
x12 1G 74414 99 130111 72 71537 99 70845 99 113637 99 764.6 7
x12 2G 74760 99 121605 68 71180 99 70960 99 112843 99 592.0 6
x12 4G 74436 99 122900 67 70235 98 71656 99 113886 99 517.6 6
x12 8G 73548 99 121984 67 69810 99 70903 99 114317 99 483.9 5
striped:
x12 128M 82165 99 178758 99 87978 99 89914 100 169667 100 +++++ +++
x12 256M 82609 99 177694 99 88034 99 90425 99 170166 100 +++++ +++
x12 512M 81175 98 162853 91 68440 98 72959 99 108344 99 1898 14
x12 1G 74927 99 175545 93 69474 99 70921 99 108706 99 772.2 7
x12 2G 74235 99 180464 97 69759 99 71413 99 110044 99 576.8 6
x12 4G 74590 99 183391 98 68781 98 71322 99 109869 99 500.3 5
x12 8G 73554 99 181112 99 68721 99 70639 99 108354 99 483.7 5
and taking off another layer again the raw disk (actually raw fc md 8-disk raid0) performance is: (machine with 512m ram)
Version 1.03 ------Sequential Create------ --------Random Create--------
x2 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
64 83309 98 +++++ +++ 79531 91 83088 97 +++++ +++ 86490 100
128 77609 98 +++++ +++ 85271 99 75764 95 +++++ +++ 82647 99
256 60415 83 307286 99 42521 56 62434 87 254417 98 20749 30
512 54449 82 109030 74 12853 20 49841 76 72260 59 7031 12
1024 47961 78 7217 5 6037 11 41890 68 4536 4 3928 8
or with 8g of ram:
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
64 81876 98 +++++ +++ 86616 100 82924 98 +++++ +++ 84999 100
128 76866 98 +++++ +++ 84830 99 69894 88 +++++ +++ 81624 99
256 68219 92 475896 100 41685 53 65176 88 +++++ +++ 20646 28
512 61219 89 474143 100 12885 19 48597 71 613828 99 6788 11
1024 50352 78 466672 99 7069 11 43718 69 602088 100 4008 8
which is the same metadata rates as going via the loopback device.
large local file i/o (without loopback and lustre) is faster overall and in particular at the small file end indicating that the use of the loopback device chewing up some ram (or maybe it's using the pagecache twice?) and so the memory left for caching is reduced.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
x2 128M 73095 98 +++++ +++ +++++ +++ 90316 99 +++++ +++ +++++ +++
x2 256M 72417 99 497033 99 140256 21 91994 99 +++++ +++ +++++ +++
x2 512M 71746 98 287194 63 78749 12 82802 92 209694 16 4843 3
x2 1G 77918 98 196828 41 84407 14 74542 95 176751 12 1117 1
x2 2G 77364 97 197818 39 82139 12 74955 96 178797 13 755.3 1
x2 4G 76654 96 192318 40 80824 13 75171 96 177510 13 594.5 1
x2 8G 77415 98 190082 40 83212 13 75069 97 177902 13 486.4 1
as there's been significant work done on loopback in more recent kernels, it's probably worth trying one out. so over the loopback to lustre again with a patchless 2.6.21 kernel (2.6.21.5-ql4-12123 with 512m ram):
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
non-striped:
x12 128M 79648 98 +++++ +++ +++++ +++ 91266 98 +++++ +++ +++++ +++
x12 256M 80577 99 121651 22 46328 8 79087 86 149902 11 561.9 0
x12 512M 79214 99 125270 25 58903 10 76333 82 169140 9 179.2 0
x12 1G 78711 97 104556 21 63247 11 83092 91 167043 12 133.2 0
x12 2G 71558 89 104218 19 59100 10 86652 95 169350 12 111.8 0
x12 4G 43307 53 21161 4 56703 9 65743 71 169863 9 100.7 0
x12 8G 72987 95 68835 13 57292 10 82749 97 168610 12 89.8 0
striped:
x12 128M 77649 95 195880 37 +++++ +++ 92029 99 +++++ +++ +++++ +++
x12 256M 79538 98 207232 39 96931 16 73016 80 296909 25 600.7 0
x12 512M 79120 98 225059 42 84653 14 84748 94 299972 24 184.4 0
x12 1G 79311 99 43567 9 86541 14 80913 89 280136 20 129.7 0
x12 2G 77980 97 169843 33 80579 15 82554 91 298408 24 109.6 0
x12 4G 78255 97 162985 31 82264 14 86364 93 308963 23 99.1 0
x12 8G 36145 45 130442 25 73408 12 87381 96 288672 22 88.0 0
non-striped metadata:
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
64 78374 95 +++++ +++ 84748 99 79149 98 +++++ +++ 83973 99
128 37261 48 +++++ +++ 78452 98 70158 90 +++++ +++ 79655 99
256 56365 80 94341 56 17367 25 51654 74 168267 100 12201 19
512 40524 62 24899 17 8750 13 53291 82 14114 11 4729 8
1024 (kaboom - lustre or loopback screwed up - Expected 1048576 files but only got 1048577)
1024 29138 47 3879 3 3392 5 26584 43 3337 3 962 2
or with 8g of ram:
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
64 76667 97 +++++ +++ 83189 99 77471 95 +++++ +++ 85477 99
128 73380 97 +++++ +++ 80099 99 67046 89 +++++ +++ 79076 100
256 64384 91 501567 99 55722 75 63134 89 +++++ +++ 23581 33
512 61282 92 493828 99 12559 18 60344 89 656636 100 6185 9
1024 54117 85 473407 99 5653 8 53991 84 630141 100 3209 5
striped metadata:
Version 1.03 ------Sequential Create------ --------Random Create--------
x12 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
64 77812 96 +++++ +++ 84960 100 78260 97 +++++ +++ 87045 100
128 71789 96 +++++ +++ 82914 99 70483 92 +++++ +++ 80925 100
256 50736 73 115843 68 18204 26 48132 69 163490 100 12450 19
512 46153 70 35308 26 16001 24 33450 51 21343 17 8670 14
1024 27756 45 3687 2 4622 7 24760 40 3380 3 1278 2
so that is a LOT better than the 2.6.9 kernel at large striped reads (was 64MB/s, now 280MB/s), but worse at writes (was 220MB/s, now an erratic 130-160MB/s).
and using xfs as the filesystem instead of ext3 we see with a 512m ram, patchless 2.6.21.5-ql4-12123, striped lustre file:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
x12 128M 86319 100 +++++ +++ +++++ +++ 91543 100 +++++ +++ +++++ +++
x12 256M 80123 92 212284 21 31501 3 69843 93 294587 22 642.0 0
x12 512M 85736 99 180648 18 47326 6 88659 98 288020 21 179.7 0
x12 1G 40173 46 179287 16 27014 3 85657 93 305233 19 128.8 0
x12 2G 10855 12 174763 17 28946 3 91699 99 305303 17 113.2 0
x12 4G 84964 98 160715 16 30050 4 90834 98 305141 19 107.3 0
x12 8G 83637 98 159883 16 27043 3 86765 99 299432 17 101.7 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 896 3 +++++ +++ 12925 42 14595 60 +++++ +++ 6062 27
32 13683 53 +++++ +++ 12498 36 12821 51 +++++ +++ 5623 22
64 8602 39 +++++ +++ 12863 46 7234 40 +++++ +++ 5510 22
128 6471 42 141294 91 11343 48 3846 26 142412 90 4427 20
256 4114 39 21777 32 5110 25 3825 37 39300 61 2293 11
512 3071 46 13083 22 5486 28 2973 45 4545 8 1905 10
1024 2232 57 12239 19 4382 20 2396 61 471 1 232 1
which is better again at reads - now 300MB/s and consistent writes at 160MB/s. however the small file performance isn't stellar with XFS. the create's are especially slow, but they're all about 10x slower than ext3.
ext3 with 512m and an all 2.6.20.15-lustre-1.6.0.1-rc1-ql6 lustre setup. 2 raid0 osts on x1, mds on x17 like usual
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
x12 128M 78757 98 +++++ +++ +++++ +++ 91907 99 +++++ +++ +++++ +++
x12 256M 66811 99 363591 76 92605 27 70352 80 299893 21 526.1 0
x12 512M 77062 97 225773 48 98259 28 78223 86 294959 23 119.7 0
x12 1G 77779 98 210867 46 87436 22 86318 96 263253 22 103.1 0
x12 2G 77835 98 188822 42 57098 11 86446 95 277524 20 93.1 0
x12 4G 67827 98 47465 10 84544 19 90536 98 293536 16 71.8 0
x12 8G 22369 28 75100 16 78600 19 86958 96 309570 20 70.2 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
32 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
64 75602 93 +++++ +++ 84400 99 76720 96 +++++ +++ 85669 99
128 70588 95 +++++ +++ 78892 99 70081 93 +++++ +++ 79402 99
256 54633 78 166117 99 17705 26 63806 94 171662 99 13109 20
512 44551 70 21755 16 13657 21 36127 56 15343 13 8132 13
1024 28977 47 3201 2 4040 6 31971 52 3292 3 1174 2
2048 18197 31 501 0 301 0 19888 34 337 0 164 0
Later: I can't remember if this last run was to a striped lustre file or not - I presume it was. assuming that then patched 2.6.20 vs. patchless 2.6.21 is very similar, as you'd expect if metadata wasn't critical.
and if you want to save the time of building scratch disks of various sizes then a bunch of say, 20g loopback files can be pre-created, and then multiple loop devices created on a node, and the loop0, loop1, ... raided together with md or with lvm to make a bigger scratch disk. layers upon layers upon layers...
Blas and Lapack
-O3 compiler options used throughout.
blas
g77 + mkl 9.0 fails 5 of the netlib blas1 tests. goto 1.15 passes them all. netlib's reference blas implementation (3.1.1) passes them all when compiled with gfortran or g77, but fails 10 tests when compiled with ifort 10.0.017
lapack
gfortran compiled lapack gets 269 failures in 11 tests, which is the best of them.
ifort compiled lapack gets 529 failures over 25 tests, mostly in dgd,sgd,dgg,sgg.
gfortran/g77 with goto blas 1.16 (includes a fixed core2 dgemv) gets 271 failures in 13 tests, so that's close to minimum.
g77 + mkl blas hangs whilst running lapack linear tests and failed ~20% of the linear tests up to this point. if testing programs are killed to allow the next test to run, then eventually (after ~10 kills) it gets 103796 failures in 131 tests, but hasn't run all the tests.
ifort + mkl gets 535 failures across tests, so a lot better than g77 + mkl.
Apps
- tests 3,6 are nearly totally independent of i/o speeds, so they are left out of some tests
- kernel.org kernels and 2.6.9-55.EL_lustre-1.6.1smp have their default MaxReadReq of 128 unless otherwise stated. kernel 2.6.9-42.0.10.EL_lustre-1.6.0.1smp has a default of 512
| oss/ost | mds/mgs | client |
|---|---|---|
| 1 | 11 | 13 |
| 2 | 10 | 12 |
| xe | 16 | 14 |
- 1OSS, 2OST, ramdisk MDS, 16-disk fc raid0 over IB, intel10 mkl9, all kernels 2.6.9-42.0.10.EL_lustre-1.6.0.1smp, 1M lustre stripe
| test | time |
|---|---|
| e | 4:31:07.31 (x1), 4:33:48.52 (x2) |
| 3 | 1:24:57.73 (x1), 1:24:53.04 (x2) |
| 4 | 3:27:49.92 (x1), 3:29:54.27 (x2) |
| 6 | 5:55:19.48 (x1), 5:55:16.59 (x2) |
- same but to 1OSS, 2OST 12-disk sas on xe (dd write @ 237 MB/s)
| test | time |
|---|---|
| e | 4:33:47.51, 4:39:27.98 |
| 3 | 1:24:47.28, 1:24:58.46 |
| 4 | 3:28:53.57, 3:28:57.37 |
| 6 | 5:54:29.81, 5:54:49.06 |
- same as top but to loopback ext2 fs backed by 50g file on Lustre
| test | time |
|---|---|
| e | 4:33:59.11 |
| 3 | 1:24:10.35 |
| 4 | 4:02:27.64 |
| 6 | 5:52:16.88 |
- to xe sas raid0 (same as 2nd top) but over gigE
| test | time |
|---|---|
| e | 5:16:37.91, 5:10:40.35 |
| 3 | 1:25:42.83 |
| 4 | 4:02:21.51, 4:02:17.60 |
| 6 | - |
- same as above (xe sas raid0 gigE) but to loopback ext2 fs backed by 50g file on Lustre
| test | time |
|---|---|
| e | 4:43:32.08, 4:43:32.41 |
| 3 | 1:24:11.87 |
| 4 | 4:14:11.53, 4:12:47.80 |
| 6 | 5:52:05.80 |
- same as above but with /proc/sys/vm/dirty_expire_centisecs set to 180000 (30mins) (default is 3000 = 30s)
| test | time |
|---|---|
| e | 4:40:24.84 |
| 3 | - |
| 4 | 4:13:14.29 |
| 6 | - |
- same as 2nd top (xe) but to raid5
- same as top, but to raid5 (dd write @ 162 MB/s)
| test | time |
|---|---|
| e | 4:53:54.43, 4:54:18.03 |
| 3 | 1:25:10.82, 1:25:04.53 |
| 4 | 3:36:27.48, 3:35:36.38 |
| 6 | 5:55:48.45, 5:55:22.82 |
- same as above (ie. fc x1 raid5), but to loopback ext2 fs backed by 50g file on Lustre
| test | time |
|---|---|
| e | 4:42:43.06 |
| 3 | 1:24:06.09 |
| 4 | 4:10:00.02 |
| 6 | 5:51:46.69 |
- same as top, but to raid5 over GigE
| test | time |
|---|---|
| e | 5:28:37.98 |
| 3 | 1:25:33.21 |
| 4 | 4:08:19.37 |
| 6 | 5:57:55.91 |
- same as top, but to 50g ext2 filesystem on loopback to raid5 over GigE
| test | time |
|---|---|
| e | 4:54:47.09 |
| 3 | 1:24:07.01 |
| 4 | 4:22:03.44 |
| 6 | 5:51:30.15 |
- same as top, but kernel 2.6.9-55.EL_lustre-1.6.1smp and to raid5 over gigE
| test | time |
|---|---|
| e | 5:01:50.45 |
| 3 | - |
| 4 | 3:45:56.26 |
| 6 | - |
- same as above, but to loopback ext2 fs backed by 50g file on Lustre
| test | time |
|---|---|
| e | 4:38:21.30 |
| 3 | - |
| 4 | 3:56:46.48 |
| 6 | - |
- same as top, but kernel 2.6.9-55.EL_lustre-1.6.1smp and to raid5
| test | time |
|---|---|
| e | 4:29:52.33 |
| 3 | - |
| 4 | 3:11:00.40 |
| 6 | - |
- same as above, but to loopback ext2 fs backed by 50g file on Lustre
| test | time |
|---|---|
| e | 4:26:22.23 |
| 3 | - |
| 4 | 3:31:43.03 |
| 6 | - |
- same as top, but to raid5 and patchless 2.6.22.4 lustre 1.6.1 client kernel. 2.6.9-55.EL_lustre-1.6.1smp on MDS/OSS
| test | time |
|---|---|
| e | 4:30:39.39 |
| 3 | 1:23:16.40 |
| 4 | 3:11:38.30 |
| 6 | 5:48:39.95 |
- same as above, but to loopback ext2 or xfs fs backed by 50g file on Lustre
| test | time |
|---|---|
| e | 4:25:22.50 (ext2), 4:26:26.62 (xfs) |
| 3 | 1:23:29.13 |
| 4 | 3:26:22.63 (ext2), 3:19:09.60 (xfs) |
| 6 | 5:49:26.41 |
- same as top, but over gigE and to raid5 and patchless 2.6.22.4 lustre 1.6.1 client kernel. 2.6.9-55.EL_lustre-1.6.1smp on MDS/OSS
| test | time |
|---|---|
| e | 4:59:14.24 |
| 3 | - |
| 4 | 3:41:55.50 |
| 6 | - |
- same as above, but to loopback ext2 or xfs fs backed by 50g file on Lustre
| test | time |
|---|---|
| e | 4:45:59.82 (xfs) |
| 3 | - |
| 4 | 3:50:12.86 (xfs) |
| 6 | - |
- same as above, (gigE, raid5, 2.6.22.4 client, 2.6.9-55.EL_lustre-1.6.1smp on MDS/OSS) except with a cpu or io intensive job running on the OSS. to native lustre on the client, or to loopback XFS.
| test | io | io, loopback XFS | cpu | cpu, loopback XFS |
|---|---|---|---|---|
| e | 5:08:25.90 | 4:55:59.05 | 5:03:00.11 | 4:48:51.24 |
| 4 | 3:48:16.44 | 3:59:07.66 | 3:45:29.66 | 3:53:53.29 |
- runs to lustre filesystem (via the lo network interface) on the OSS whilst the above were running on the client
| test | time |
|---|---|
| e | 4:49:36.42, 4:54:34.17, 4:39:47.22 |
| 3 | 1:33:14.57, 1:32:03.50, 1:28:15.36 |
| 4 | 3:36:49.01, 3:30:16.65 |
| 6 | 6:06:00.62, 6:18:43.35 |
- same as top, but patched 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 kernel
| test | time |
|---|---|
| e | 4:14:10.67 |
| 3 | 1:23:39.51 |
| 4 | 3:07:49.81 |
| 6 | 5:50:56.24 |
- same as above, but MaxReadReq=4096 (ie mthca tune_pci=1)
| test | time |
|---|---|
| e | 4:15:22.96 |
| 3 | - |
| 4 | 3:06:39.27 |
| 6 | - |
- same as top, but patched 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 kernel and to loopback ext2, ext3 and XFS filesystems
| test | time |
|---|---|
| e | 4:12:53.76 (ext2), 4:19:37.79 (ext3), 4:11:29.38 (xfs) |
| 3 | - |
| 4 | 3:16:35.76 (ext2), 3:20:52.34 (ext3), 3:10:11.53 (xfs) |
| 6 | - |
- same as above, but MaxReadReq=4096 (ie mthca tune_pci=1) and just the usual ext2
| test | time |
|---|---|
| e | 4:11:44.77 |
| 3 | - |
| 4 | 3:16:26.09 |
| 6 | - |
- same as top, but 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 modern kernel over gigE
- same as top, but patched 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 kernel over gigE and to loopback ext2 filesystem
- same as top, but to raid5 and 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 modern kernel
- kaboom
- same as top, but to raid5 and 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 modern kernel and to loopback ext2, ext3, xfs
- same as 2nd (ie. xe raid0 ib), but client has a patched 2.6.22.1-lustre-1.6.0.1-rc1-ql6-bug12470 kernel. MDS and OSS still have 2.6.9 lustre kernel
| test | time |
|---|---|
| e | 4:20:16.52 |
| 3 | - |
| 4 | 3:07:38.24 |
| 6 | - |
- same as above but to ext2, xfs loopback
| test | time |
|---|---|
| e | 4:13:21.20 (ext2), 4:12:36.92 (xfs) |
| 3 | - |
| 4 | 3:18:30.49 (ext2), 3:12:53.61 (xfs) |
| 6 | - |
- same as 2nd top (xe sas raid0 ib), but patchless 2.6.22-ql6-rc1 client kernel. OSS/MDS are still 2.6.9-42.0.10.EL_lustre-1.6.0.1smp
| test | time |
|---|---|
| e | 4:18:24.25 |
| 3 | - |
| 4 | 3:07:32.42 |
| 6 | - |
- same as above, but to loopback ext2 fs backed by 50g file on Lustre
| test | time |
|---|---|
| e | 4:13:09.91 |
| 3 | - |
| 4 | 3:18:48.82 |
| 6 | - |
- same as top, but goto1.16 instead of mkl's blas
- same as top, but gfortran and goto1.16
- same as top, except 2OSS, 4OST, x1 with 2*8-disk raid0, x2 with 2*7-disks raid0
| test | time |
|---|---|
| e | 4:30:08.17 |
| 3 | 1:25:10.58 |
| 4 | 3:32:46.33 |
| 6 | 5:57:10.02 |
- 3oss 15ost centos5.1, r5, exe rebuilt for centos5.1 with intel 10.1 compilers, mkl9, running on 'x1' 2.6.18-53.1.13.el5-lustre1.6.4.2rjh to Lustre /short
| test | time |
|---|---|
| e | 4:13:42.93 |
| 3 | 1:24:10.57 |
| 4 | 3:06:05.56 |
| 6 | 5:49:03.54 |
- as above, but to jobfs (jobfs files on lustre are 4-way striped)
| test | time |
|---|---|
| e | 4:26:16.21 |
| 3 | 1:24:15.35 |
| 4 | 3:28:09.87 |
| 6 | 5:50:28.39 |
- when doing 11 of the above at once on 11 different nodes, we see
| test | time to jobfs | time to native lustre |
|---|---|---|
| 4 | 3:51.9 +/- 5.2mins | 3:26.8 +/- 4.4mins |
so jobfs is still slower even when many are being run at once. so this looks like a completely non-metadata dominated test.
in another run over the full 4 tests (e,3,4,6), jobfs uses 1036510 MB (15:14:03 walltime, 18.9mb/s ave) and native uses 1132872 MB (14:34:13 walltime, 21.6mb/s ave). so jobfs uses ~10% less bandwidth due to caching.
- 3oss 15ost centos5.1, r5, intel compilers (intel 10 runtime, but same exe as one of those above), mkl, running on 'xe' which is 2.8ghz supermicro node, 2.6.18-53.1.4.el5-lustre1.6.4.1rjh, to Lustre
| test | time |
|---|---|
| e | 3:48:21.98 |
| 3 | 1:07:57.60 |
| 4 | 2:28:56.08 |
| 6 | 3:25:00.61 |
which is a HUGE speedup for all of them...
- as above except MDS is now 8-core 2.8GHz node
| test | time |
|---|---|
| e | 3:48:35.04, 3:49:11.58 |
| 3 | 1:07:58.65, 1:11:13.23 |
| 4 | 2:29:29.68, 2:30:46.26 |
| 6 | 3:25:00.68, 3:28:51.40 |
- as above, but to jobfs ext2
| test | time |
|---|---|
| e | 3:57:20.69 |
| 3 | 1:07:55.16 |
| 4 | seek error, seek error |
| 6 | 3:25:19.06 |
- as above (jobfs ext2) but with a 2.6.23.14 kernel and patchless 1.6.4.1 lustre
- so a newer kernel's loop doesn't seem to help... must be a loop problem, or a loop/lustre interaction problem?
| test | time |
|---|---|
| e | 3:55:08.79 |
| 3 | 1:07:53.62 |
| 4 | seek error |
| 6 | 3:26:44.92 |
- as above (jobfs ext2) but with ext2 rebuilt so it fits into loop0(!). 2.6.18-53.1.4.el5-lustre1.6.4.1rjh kernel
- so loop is slower than native lustre. I wonder if the application is actually metadata intensive at all?
| test | time |
|---|---|
| e | 3:57:30.98 |
| 3 | 1:07:47.47 |
| 4 | 2:50:55.49 |
| 6 | 3:25:20.21 |
- as above, but ext3 loop. 2.6.18-53.1.4.el5-lustre1.6.4.2rjh kernel
| test | time |
|---|---|
| e | 4:06:16.49 |
| 3 | 1:08:08.58 |
| 4 | 2:54:30.53 |
| 6 | 3:25:49.27 |
- 3oss 15ost centos5.1, 4coreMDS, exe rebuilt for centos5.1 with PGI 7.1-2 compilers, mkl9, running on a compute node 2.6.18-53.1.13.el5-lustre1.6.4.2rjh to 4-way striped Lustre /short
| test | time |
|---|---|
| e | 4:11:02.76 |
| 3 | 1:37:59.21 |
| 4 | 3:06:15.12 |
| 6 | 6:29:33.26 |
so e, 4 (disk bound) are the same as intel, 3 and 6 (compute bound) are 16.6% and 11.2% slower than the intel versions.
- as above, but with ATLAS libs instead of mkl9
| test | time |
|---|---|
| e | |
| 3 | 1:50:13.02 |
| 4 | |
| 6 | 6:55:01.55 |
which on 3,6 is 12% and 7% slower than pgi+mkl, and they're 31% and 19% slower than intel+mkl.
- on ac to /fast - best of 2 runs. jobfs is usually slower. default install - intel8, scsl
| test | time |
|---|---|
| e | 6:48:51.09 |
| 3 | 1:48:56.07 |
| 4 | 4:15:13.91 |
| 6 | 8:44:14.43 |
- on ac to jobfs, rebuilt with intel8 scsl
| test | time |
|---|---|
| e | 7:13:59.74 |
| 3 | 1:50:18.46 |
| 4 | 4:59:19.80 |
| 6 | 9:11:11.56 |
- on ac to jobfs, intel10 scsl
| test | time |
|---|---|
| e | 6:56:56.47 |
| 3 | 1:50:35.27 |
| 4 | 4:16:07.25 |
| 6 | crash & >11hours |
- on ac sles10 node to /fast with intel10 mkl9
| test | time |
|---|---|
| e | crash |
| 3 | crash |
| 4 | crash |
| 6 | 12:00:19.77 |
Raid1 Boot
transfer from a 1-disk system to a raid1 system is pretty easy. partition up the spare disk as you want.
mdadm --create /dev/md0 /dev/sdb1 missing mkfs -t ext3 /dev/md0 mount /dev/md0 /mnt/newroot rsync -avxHP / /mnt/newroot/
edit the old fstab to point to the new partitions (eg. LABEL=/ ==> /dev/md0) and re-make the initrd image so that the raid1 module is included. edit the old grub.conf so that it has root=/dev/md0. reboot and you should be in the new raid1 root disk now. clone the new disk's partition table back to the previous root disk with
sfdisk -d /dev/sdb | sfdisk /dev/sda
then
mdadm --add /dev/md0 /dev/sda1
and wait for rebuilds. fix up grub so that there's one on each superblock
# grub grub> root (hd0,0) grub> setup (hd0) grub> root (hd1,0) grub> setup (hd1) grub> quit
and have a default=0/fallback=1 entry in grub.conf that points to hd(0,0)/hd(1,0)
Other
not strictly Xe related, but LD_PRELOAD hacks to try to replace st_blksize with sane/larger value all fail when a fopen() or open() and then fread() is done. a fstat() is called from the fread() (glibc? kernel?) which gets the filesystem default st_blksize value and no amount of __xstat/__fxstat/... trickery seems to be able to override this internal(?) fstat. sigh
Lustre's default st_blksize on a 1m striped fs is 2m. cxfs has something huge (bug via many many raid st_blksize's accumulating) or 16k.
Errors and Hardware problems
See Xe Errors
