Sunnyvale

From CITA Computing
Jump to: navigation, search
The Sunnyvale HPC Cluster

About

Sunnyvale Nodes

Sunnyvale is the computer cluster for the Canadian Institute for Theoretical Astrophysics. It is composed of a heterogeneous collection of ~120 nodes of various speeds and sizes acquired over time.

Sunnyvale currently runs CentOS 6.9 x86_64 GNU/Linux. Our server/compute node image runs from a diskless nfs read-only root partition using the oneSIS system imager v2.0rc10 release + patches (see their mail archive). This entire software stack is free and open source, with no licensing restrictions.

Naming of the Sunnyvale cluster was inspired by Trailer Park Boys, a popular Canadian comedy tv/movie series.

User Documentation

Accounts

CITA and Sunnyvale accounts are the same--i.e. if you have a general CITA account that account works on the sunnyvale cluster and vice versa, and they will always have the same password, there is no need to request a second account if you already have one. If you do not have a CITA account at all, to request an account for use on Sunnyvale email requests(at)cita.utoronto.ca

Passwords

To change your password for your CITA/Sunnyvale account, use the /cita/local/bin/passwd command on the machine trinity. It takes about an hour for password changes to take full effect across all CITA machines and Sunnyvale.

Logging In

Once you are on the CITA network, log into Sunnyvale via the 2 login nodes, bubbles or ricky. These nodes also serve as the cluster development nodes and can be used for compiling codes, running interactive debugging jobs, submitting batch jobs to the queueing system, and transferring data to/from cluster storage. These are the only cluster nodes which you should directly access.

Shell Environment

Home disks on Sunnyvale are not the same as the CITA workstation network, so you will have to copy any required files to your Sunnyvale /home. For convenience, you can access this filesystem on your desktops at the mount point /cita/d/sunny-home.

The default shell is /bin/bash but you may also use /bin/tcsh. If you would like to have your shell changed from whatever it currently is to anything else, please email requests. We recommend using /bin/bash.

Users also need to write their own shell configuration file. Do not copy your shell configuration from your /cita/home directory, since these may contain incompatible configuration statements. For a bare-bones configuration that maximizes shell limits and uses the default intel compiler, one would put the following into their ~/.bashrc file:

 ulimit
 module load intel

Modules and System Software

All system-installed software is accessible through the use of the module command. A brief summary of how to use modules on Sunnyvale is explained on the Modules page. Sunnyvale loads two default modules - torque and maui -- but all other modules must be loaded the user either manually, within their batch scripts or by default via their shell start-up definitions in ~/.tcshrc or ~/.bashrc.

Any requests for further software installation should be sent to requests@cita.

Storage

Disk

The lustre network filesystems scratch-lustre, raid-cita, and raid-project are all accessible on Sunnyvale nodes. The mount prefix for these systems is /mnt e.g., /mnt/scratch-lustre etc. Batch jobs should run exclusively on the lustre filesystems.

/home

Home directories are currently capped with a 10GB quota. This is a good place to keep your code, essential scripts and documentation since the space is backed up daily

Users are asked to not use /home for job output. Since it is served from the cluster head node, having large parallel jobs write to it simultaneously will cause performance issues for the entire cluster. There is no problem with short and infrequent accesses, like reading a small configuration file or submitting your job from it, for example, however heavy use in terms of either reading/writing a large file or very many small ones rapidly is problematic. If we see that you are overloading the home filesystem and causing slowdowns for the entire cluster/other users, we may have to kill your job, so please see below for the appropriate heavy-duty filesystems.

/mnt/raid-cita, raid-project

These are fast parallel filesystems that are available to Sunnyvale nodes. All users have at least 100 GB available on /mnt/raid-cita, except for the SRAs, post-docs and staff who have 4TB by default. If you have exceptional data requirements, please contact requests@cita and your quota can be increased. The raid-project space is overseen by CITA faculty and permission from individual researchers is necessary to get a directory on these filesystems.

/mnt/scratch-lustre

This scratch space is a large, fast parallel filesystem. It is meant to be used for temporary intermediate and final output of cluster jobs. This filesystem is for short term storage and postprocessing only and archival data should be migrated to raid-cita or raid-project as scratch is not backed up in any capacity so data stored on it is not guaranteed against loss (see the Disk Space page for more information).

External storage

The desktop home spaces home-1, home-2, and home-3 are only visible on the head nodes.

/tmp

DO NOT use /tmp to store any files! This filesystem is part of the scratch disk on each node and has to have enough space free for various system daemons to function properly. Any files found in /tmp will be deleted without warning.

/mnt/node_scratch

Local scratch disks are available on all nodes at

/mnt/node_scratch/<username>

There is generally about 100GB available on each node.

If you are I/O limited by the cluster file systems then you may be better off using these disks. Use sparingly, as this can severely degrade network performance.

Accessing

They can be accessed on other nodes as /mnt/scratch/<node>., where <node> is the host name of the node, e.g. tpb78. NOTE: autofs access to nodes is currently turned OFF. It can be reactivated if necessary - jjd - 2014-03-18

File Deletion

You MUST tidy up after your jobs. rm all the local_scratch files that you create at the end of your job.

These scratch disks will be purged daily for all files older than _one_week_.

WARNING: if the disks are chronically filled, a more severe method will be used. The system will automatically remove all the user's files after the end of the job.

/mnt/scratch/tpbNNN

All nodes export their scratch disk /mnt/node_scratch to all other nodes. They can be accessed at /mnt/scratch/tpbNNN. NOTE: autofs access to nodes is currently turned OFF. It can be reactivated if necessary - jjd - 2014-03-18

PBS

Sunnyvale uses the TORQUE batch system and the Maui job scheduler to process user jobs. This system allows users to submit a batch script (a shell script with a special header) that describes their computing work, which is then held in a queue and run non-interactively once sufficient resources are available.

This software is also used at SciNET. Users should be able to port their batch scripts to Sunnyvale and vice versa with little modification.

PBS Limits

Jobs are limited to 48 hours in duration.

If you have special job requirements please contact requests@cita.utoronto.ca to apply for a reservation on the cluster, but cluster reservations are provided sparingly and usually only in exceptional circumstances. We prefer users make do with the cluster as-is i.e. if you have a job that needs to run for more than 48 hours, find a way to add 'checkpoints' into its execution where you can run a 48-hour job, have it conclude, then submit a new job that continues where the last one left off using some sort of intermediate files to carry over progress. This policy is in place to keep things fair and let every user get a chance to run their jobs.

Submitting Jobs

Jobs are submitted to the queue by executing qsub $BATCH_SCRIPT_NAME.

We recommend that you use the /bin/bash shell in your scripts. Here is an example of a batch script for single core jobs:

Single-core/single-node batch job script

----
#!/bin/bash -l
#PBS -l nodes=1:ppn=1
#PBS -l walltime=48:00:00
#PBS -r n
#PBS -j oe
#PBS -q workq

# NOTE: workq only allows nodes=1 and ppn<=8
# Load your modules
# if you use module purge, make sure to load the maui and torque modules
# e.g.,
# module purge
# module load maui torque

module load gcc/7.3.0 python/2.7.14

# go to your working directory containing the batch script, code and data
cd $PBS_O_WORKDIR
./your_code
----

NOTE: We highly recommend users use the the queue workq for serial jobs.

Here is an example of a multi-node batch job script for a parallel MPI code:

----
#!/bin/bash -l
#PBS -l nodes=8:ppn=16
#PBS -l walltime=48:00:00
#PBS -r n
#PBS -j oe
#PBS -q hpq

# NOTE: for hpq and sandyq, ppn<=16
#       for greenq ppn<=32
#       multiples of 8 are preferred for ppn, e.g., ppn=8,16, or 32
# If you are using nodes=1 and ppn<=8, please use the workq instead

# go to your working directory
cd $PBS_O_WORKDIR

# Load your modules
# If you use module purge, make sure to load the maui and torque modules
# e.g.,
# module purge
# module load maui torque

# Intel compiler and MPI 
module load intel intelmpi

# or for OpenMPI
# module load gcc/7.3.0 openmpi/3.0.0-gcc-7.3.0

mpirun ./your_mpi_code your_arguments

# NOTE: mpirun without args will use all available cores e.g, 8x16=128 in this example
# To limit the number of cores per host (say for hybrid Openmp/OpenMPI codes) use one of the 
# mpirun load balancing options
# e.g., for OpenMPI one 

mpirun -np 8 -map-by node:SPAN ./your_mpi_code

# Infiniband options
# The infiniband network provides special network protocols to speed up your MPI code
# These are turned on by default for the intelmpi module and are the only option
# For OpenMPI, use these flags to speed up MPI communication with infiniband

mpirun --mca btl self,sm,openib ./your_mpi_code
-----

NOTE: the first line will initialize the script with your ~/.bash_profile and ~/.bashrc login scripts. If your login shell is /bin/tcsh (~/.tcshrc login script), then start your scripts with this line instead:

#!/bin/csh
 

The first 5 lines are mandatory and must appear in the script. Set the number of nodes in nodes and processors per node in ppn. These are the available queues: workq, hpq, sandyq and greenq. Note: if you are running single core serial jobs all you need to do is set ppn=1 above and the batch system will find a node for your code to run on.


There is a 48 hour limit for jobs running on Sunnyvale, but the walltime option can (and should) be set to less than this as the scheduler will advance your job to the top of the queue faster if it doesn't require as much wallclock time. Efficient job scheduling improves job throughput.

Sunnyvale currently has a heterogeneous mix of nodes with 8, 12 and 16 cores/node. Users will generally want to specify the appropriate value of ppn for a given queue, if they would like to have nodes fully allocated to their own job. Submitting with ppn<8 will allow for multiple jobs to be assigned to the same node and could result in the memory being oversubscribed and both jobs failing and/or crashing the node. If each process in your run requires more than the available memory per core, the trick is to use either the -loadbalance flag (openmpi v1.6 or less) or -map-by node:SPAN (openmpi >v1.8), e.g., say your code could only used 3 cores/node because of memory requirements and you wanted to submit a 8-node/24-core job. You would then specify ppn=8 in the batch script but submit the MPI job with:

mpirun -np 24 -loadbalance your_prog ...

or

mpirun -np 24 -map-by node:SPAN your_prog ...

The following PBS environment variables are useful for tailoring batch scripts:

Variables that contain information about job submission:

 PBS_O_HOST            The host machine on which the qsub command was run.
 PBS_O_LOGNAME         The login name on the machine on which the qsub was run.
 PBS_O_HOME            The home directory from which the qsub was run.
 PBS_O_WORKDIR         The working directory from which the qsub was run.

Variables that relate to the environment where the job is executing:

 PBS_ENVIRONMENT       This is set to PBS_BATCH for batch jobs and to PBS_INTERACTIVE for interactive jobs.
 PBS_O_QUEUE           The original queue to which the job was submitted.
 PBS_JOBID             The identifier that PBS assigns to the job.
 PBS_JOBNAME           The name of the job.
 PBS_NODEFILE          The file containing the list of nodes assigned to a parallel job.

Queues

Sunnyvale has 4 batch queues.

workq: General purpose queue which can use any available node. This queue is explicitly intended for single-core (serial) or single-node general purpose jobs. The queue allows a maximum of 1 node. The upper limit on the number of processes per nodes is ppn=8. Jobs are guaranteed a minimum of 2G/core.

hpq: 49X 12-core/32GB nodes Even though the node has 12-cores, the upper limit on the number of processes per nodes is slightly higher with ppn=16. The nodes are hyperthreaded allowing some slight efficiencies in oversubscribing the 12-cores with 16 processes.

sandyq: 35X 16-core/64GB nodes. The upper limit on the number of processes per nodes is ppn=16.

greenq: 10x 32-core/128GB nodes. These are the fastest nodes. The upper limit on the number of processes per nodes is ppn=32.

Remember to add the line e.g., #PBS -q hpq" in your batch script so your job is submitted to the right queue.

The properties of the new batch queues are summarized in this table, rank ordered by the relative processor speed (this is a rough approximation).

Sunnyvale Queues
Purpose Queue Processor Nodes Cores per Node Memory (GB) PBS "ppn" value PBS "nodes" value max Node priority, descending rel. speed
Parallel greenq Xeon Gold 6130 10 32 128 32 10 green only 4.0X
sandyq Xeon E5-2650 35 16 64 16 45 sandy, green 2.0X
hpq Xeon E5-2620 49 12 32 16 94 hp, sandy, green 1.5X
Serial workq All 118 8-32 >24 8 1 hp, old, sandy, green 1.0X-4.0X

Hyperthreading

All of the new nodes have the hyperthreading option turned on, so the number of apparent logical processors is twice the number of actual cores e.g., the greenq nodes have 32 real cores and 64 logical processors. Feel free to overload your MPI jobs with twice as many cores. You may realize an overall speed gain of 10-20% depending on your application.

For example, your mpi job invocation for a script with nodes=8:ppn=16 (nominally 128 cores) with the intelmpi module might be:

mpirun -np 256 ./your_mpi_code your_arglist...

For OpenMPI, you might have to specify that your are using more processes than cores (oversubscribing) with this command:

mpirun -np 256 --map-by node:OVERSUBSCRIBE ./your_mpi_code ...

Submitting Interactive Jobs

To submit an interactive job for 2 nodes, execute the following on a login node e.g.,:

 qsub -q QUEUE_NAME -I -X -l nodes=2:ppn=8

where QUEUE_NAME is the name of one our batch queues, e.g., fastq, hpq, sandyq or greenq

This will open a shell on the first node of your job. To see which nodes you have been assigned:

 cat $PBS_NODEFILE

One can load modules, start a process, debug, etc. interactively. The job will terminate once you log out of the node that you logged in on, or once the walltime limit has been exceeded.

NOTE: Please use this method sparingly! Most jobs should be submitted non-interactively. If users leave nodes idle while using them interactively the policy on Sunnyvale will change, as it is a shared resource and tying up nodes is unfair to the other users.


Submitting Jobs on Reserved Nodes

If you want a particular job to run on a reservation made for you then make sure to include the following flag:

-W x=FLAGS:ADVRES:{$RESERVATION NAME}

Monitoring Jobs

bobMon System Monitor

The web-based graphical System_Monitor is now available for Sunnyvale. If you are local, point your browser to bobMon.

If you are logging in remotely you will have to create an ssh tunnel to barb to forward the http port. To do so, when you ssh into gw.cita.utoronto.ca add the following to your ssh command: ssh gw.cita.utoronto.ca -L 10101:sunnyvale:80 -- this connection has to stay active, so don't close it. Then you should be able to view it in your local browser with tunneled bobMon

Command Line Monitoring

Upon submission users can monitor their jobs in the queue using qstat -a. The showbf command will show resource availability, and the showstart command will suggest when a particular job will begin. Additional queue and total cluster node usage can be displayed with showq.

Job Identifier and Deletion

Each job submitted to the queue is assigned a $PBS_JOBID. This is the number on the left-handside of the qstat output. If a user would like to stop a running job or delete it from the queue then they should run qdel $PBS_JOBID.

PBS Output Files

After a job has finished PBS will place an output file and an error file in the directory where the job was submitted. These files are named job_name.o$PBS_JOBID and job_name.e$PBS_JOBID, where the "job_name" is the value supplied with #PBS -N job_name in the batch script.

The *.e* file will contain any output produced on standard error by the job (that was not redirected inside the batch script), and the *.o* file will contain the standard output. In addition, the pbs output file will list the amount of resources requested for the job, the amount of resources used, and a copy of the batch script that was submitted to run the job.

What nodes are my job using?

If your job is actively running one can get a list of nodes it is using with jobNodes PBS_JOBID

Alternatively one can put cat $PBS_NODEFILE in their batch script, which will print out the list of nodes that the job is using in the PBS standard output file, and this can then be used.

How do I check the status of my nodes?

One can get a brief summary for each node at the command line by issuing gstat -a | head -10; gstat -a | grep -A1 tpb<tpb#> for each of the nodes (<tbp#> is just the numerical portion of the nodes hostname, ie tpb123.sunnyvale -> <tpb#>=123 ).

It is suggested that users monitor their jobs via bobMon, but if you'd like detailed information you can look at the ganglia data for each node. If you are not on the cita network you will have to create an ssh-tunnel to port 80 on julian, as is indicated for bobMon, and using this ganglia via tunneling.

Programming Issues

64 bit-isms

Sunnyvale is an em64t/amd64/x86_64 (64 bit) platform. Be aware of the following:

  • FFTW plan variables must be 8-bytes
  • ld: skipping incompatible message during linking
    • the linker is finding 32 bit rather than the required 64 bit libraries. Most or all of the 64bit libraries should be found in /usr/lib64/
  • relocation truncated to fit: R_X86_64_PC32
    • this only seems to be a problem with the Intel compiler.
    • can occur when trying to compile code that contains >2G of static allocations with the Intel compilers. The suggested fix is to compile with -shared-intel
    • if this doesn't work, you may have to change the mcmodel as well. (man ifort, search for mcmodel). -mcmodel=medium should suffice for all cases.

Compilers

Intel (ifort/icc/icpc) and gcc compilers of various versions are available using the modules system.

The following compiler flags may be of use:

 -O3 for full optimizations, -O0 for none
 -CB (ifort) -fbounds-check (gfortran) for runtime array bound checking
 -openmp (intel) -fopenmp

NOTE: To get the most speed, you should use processor specific optimizations appropriate for the given node architecture in the queue.

For sandyq and hpq use: -mavx for the fastest code

For greenq use: -mavx2

Run man $COMPILER_NAME for further details.

Debugging

Start by compiling your code with the -g flag to produce a symbol table. It's also a good idea to remove any optimizations. Then you can launch an xterm and a debugger (idbe/gdb) for N processes on a devel node with:

 mpirun -np N xterm -e gdb a.out

Parallel Programming

Please see the CITA introductory tutorial, which contains a brief presentation on parallel computing, a simple tutorial, and links / suggestions on where to find further information.

A simple MPI util Module (f90) is available at [1]. Also see [2] to understand how to use this Module.

Using MPI

The cluster currently supports OpenMPI and IntelMPI for MPI codes.

We have built openmpi modules specific to your choice of compiler. Check for the available openmpi modules with the command:

 module avail openmpi

And load the appropriate compiler/openmpi combination to activate it e.g.,:

module load gcc/7.3.0 openmpi/3.0.0-gcc-7.3.0

or

module load intel/intel-18 openmpi/3.0.0-intel-18

This will load the current default version but do a module list to see what other versions may exist. Newer versions might work better for you.

Intel also provides an MPI library that is optimized for the infiniband network architecture of our fastest queues: sandyq, hpq and greenq. You will obtain the lowest communication overhead using the Intel MPI library with an Intel compiler.

The most recent versions are the defaults and loaded with:

module load intel intelmpi

Older versions are also available for backwards compatibility with older code or scripts.


Networking

NOTE: There have been a lot of network changes since the time this part of the wiki was written - most of this info is obsolete so please disregard it. We keep it here as a reference for admins.

SunnyvaleNetwork.png

Switch located on each rack

thin tree

  • all 40 nodes on a given rack connect to a 48-port gigabit switch
  • racks are linked together by connecting the 5 rack switches to a top-level switch via 4-port trunked links
  • all disks, except for the ones named /mnt/scratch/local or /mnt/scratch/rack?, are effectively connected to the top-level switch via a single gigabit link

This has the following implications for network performance:

  • you get full, non-blocking gigabit networking if all the nodes in your job are on the same rack
  • there is a significant bandwidth hit if the nodes in a job span 2 or more racks;
    • e.g. the 40 nodes on rack 1 effectively share a single 4Gb link to all other nodes in the cluster.
  • any and all nodes which read or write to a non-local disk (like /cita/d/scratch-3month) are sharing a single gigabit connection
  • read/write performance is almost certainly going to be maximized if you can make use of the scratch disks thatare installed in each rack (e.g. /mnt/scratch/rack?).

fat mesh

Cables converging to the Fat Mesh switches.

There is also a fat_mesh network that can be accessed by using the hallowed beer names. This network is composed of switches attached to groups of nodes (5 groups of 8) on each rack (ie, perpendicular to the thin tree). Each of the switches is fully-interconnected with the other 4 fat_mesh switches via 10gigE connections. In theory this should improve off-switch performance by increasing the bandwidth between racks.

The beer names resolve to the thin tree addresses for nodes on the same rack, but use the fat mesh for nodes that exist on other racks.


Which node is in which rack?

  • rack 1: tpb 1-40
  • rack 2: tpb 41-80
  • rack 3: tpb 81-120
  • rack 4: tpb121-160
  • rack 5: tpb161-200

A Note On Latency

As you might guess by reading the network configurations you can get different MPI latencies between different nodes relative to where they are to each other.

When dealing with nodes on the same rack or same fat mesh stripe you are looking at around 0.0001 seconds of latency for point to point communication. Going between nodes on different stripes and different racks the latency slightly increases to around 0.000107s.

When two nodes under full load and maxing out their network with MPI communication the latency jumps to about 0.000160s.

These tests were conducted by averaging out the latencies of millions MPI sends.

Codes

Please feel free to make your own wiki page for any codes that you have run on Sunnyvale -- these hints and performance results are useful to other users!