Fake Fast Local Disk

From CITA Computing

global filesystems typically have low metadata rates. this makes them non-ideal for applications that have high iops requirements.

local direct attached scratch disks (raid0 or raid1/raid10) on nodes are the typical solution to this problem. however, on general purpose machines it can be expensive to put these on every node, and when you do then these disks are typically underutilised as not all jobs require local disk.

instead we suggest using a loopback filesystem backed by a file on a global filesystem (eg. lustre) to handle metadata intensive and cacheable workloads. this gives the appearance and most of the benefits of a local disk whilst still being centrally managed. it might even work out to be cheaper too.


Table of contents

HowTo

mount a lustre filesystem on a node, create a large file on lustre (striped or not), make a loopback filesystem on that file, mount it on the node, and off you go... ie. create with:

mount -t lustre x17ib@o2ib:/testfs /mnt/testfs
dd if=/dev/zero of=/mnt/testfs/big40 bs=1M count=40000
losetup /dev/loop0 /mnt/testfs/big40
mkfs -t ext3 /dev/loop0
mount /dev/loop0 /mnt/yo0/
chmod 1777 /mnt/yo0/

and then delete with:

umount /mnt/yo0
losetup -d /dev/loop0
umount /mnt/testfs/

once the block device is setup, then it can be mounted and unmounted with:

mount -o loop -t ext3 /mnt/testfs/big40 /mnt/yo0
... use
umount /mnt/yo0
losetup -d /dev/loop0

where the losetup -d is required otherwise loopback devices are never free'd and increment upwards.

bow before the massive metadata rates on a global (well, kinda) filesystem.

Swings and Roundabouts

plusses

  • all file i/o uses the local page cache just like a local disks so caching and metadata rates are the same as for local disk
  • lustre sees just one open file, so there is no lock contention and no load on the metadata server
  • big file i/o eventually gets flushed to lustre by the VM and performance is similar to native lustre
  • local scratch disks can now be centrally managed and provisioned on demand, hopefully making better use of available resources
  • adding more or larger fake local disks is just a matter of creating a file on lustre and mounting it

minuses

  • the loopback kernel thread uses some cpu time which might impact other jobs on the same node if the thread runs on a core different to that of the job using the loopback fs
  • depending on kernel version(?), data can be in the page cache twice (once from user i/o, once from the loopback device) which reduces the available caching over a true local disk. the duplication doesn't seem to impact performance at all.

Performance

I'm going to re-run these. for now look in the clutter of Xe#iSCSI

Alternatives

using SRP hardware or iSCSI/iSER hardware or software ISCSI targets is an alternative network accessible block device design. these all seem more complex and less flexible than the loopback on lustre approach.

  • with a hardware solution the LUNs on the box would typically be fixed in size and number, making the storage pool inflexible compared to the loopback on lustre solution. performance and reliability would likely be good though. cost would likely be high. this option is pretty much the same as a pool of FC attached raids.
  • with a software implementation the iSCSI/iSER tgt server (or nbd) machine could fail. as this machine might be handling local disks for several/many client nodes then it would be desirable to remove it as a single point of failure.
    • if the backing store is unshared then a client could use md raid1 over 2 tgt server machines, which halves the write bandwidth
    • if the backing store is shared (eg. a file on lustre, or 2 LUNs of the same FC attached device), then clients using dm-multipath would suffice to handle a tgt server machine failure