Sunnyvale

From CITA Computing

The Sunnyvale HPC Cluster
Enlarge
The Sunnyvale HPC Cluster
Table of contents

About

Sunnyvale Nodes
Enlarge
Sunnyvale Nodes

Sunnyvale is the beowulf cluster at the Canadian Institute for Theoretical Astrophysics. It is composed of ~190 Dell PE1950 compute nodes and 18 Dell R410 nodes. PE1950 nodes contain 2 quad core Intel(R) Xeon(R) E5310 @ 1.60GHz processors, 4GB of RAM, and 2 gigE network interfaces with an 80GB disk for swap and local scratch storage. R410 nodes have 8 Intel Xeon E5560 2.13GHz cores, 24GB of RAM and 2 gigE network interfaces with ~100GB for local scratch.

Sunnyvale currently runs CentOS (http://www.centos.org/) 6.3 x86_64 GNU/Linux. Our server/compute node image runs from a diskless nfs read-only root partition using the oneSIS (http://onesis.org/index.php) system imager v2.0rc10 release + patches (see their mail archive). This entire software stack is free and open source, with no licensing restrictions.

Naming of the Sunnyvale cluster was inspired by Trailer Park Boys (http://en.wikipedia.org/wiki/Trailer_Park_Boys), a popular Canadian comedy tv/movie series.

User Documentation

Accounts

CITA and Sunnyvale accounts are the same--i.e. if you have a general CITA account that account works on the sunnyvale cluster and vice versa, and they will always have the same password, there is no need to request a second account if you already have one. If you do not have a CITA account at all, to request an account for use on Sunnyvale email requests(at)cita.utoronto.ca

Passwords

To change your password for your CITA/Sunnyvale account, use the /cita/local/bin/passwd command on the machine 'trinity'. It takes about an hour for password changes to take full effect across all CITA machines and Sunnyvale.

Logging In

Once you are on the CITA network, log into Sunnyvale via the 2 login nodes, bubbles or ricky. These nodes also serve as the cluster development nodes and can be used for compiling codes, running interactive debugging jobs, submitting batch jobs to the queueing system, and transferring data to/from cluster storage. These are the only cluster nodes which you should directly access.

Shell Environment

Home disks on Sunnyvale are not the same as the CITA workstation network, so you will have to copy any required files to your Sunnyvale /home. For convenience, you can access this filesystem on your desktops at the mount point /cita/d/sunny-home.

The default shell is /bin/bash but you may also use /bin/tcsh. If you would like to have your shell changed from whatever it currently is to anything else, please email requests.

Users also need to write their own shell configuration file. Do not copy your shell configuration from your /cita/home directory, since these may contain incompatible configuration statements. For a bare-bones configuration that maximizes shell limits and uses our default MPI (OpenMPI 1.6.1) and Intel compilers, one would put the following into their ~/.tcshrc:

 unlimit
 limit coredumpsize 0
 module load intel
 module load openmpi

Modules and System Software

All system-installed software is accessible through the use of the module command. A brief summary of how to use modules on Sunnyvale is explained on the Modules page. Sunnyvale loads two default modules - torque and maui -- but all other modules must be loaded the user either manually, within their batch scripts or by default via their shell start-up definitions in ~/.tcshrc or ~/.bashrc.

Any requests for further software installation should be sent to requests@cita.

Storage

RAM

There is 4G RAM/8G swap on each of the PE1950 compute nodes tpb[1-200] and 24G RAM/48G swap on the R410 nodes tpb[201-218].

Disk

The lustre network filesystems scratch-lustre, raid-cita, and raid-project are all accessible on Sunnyvale nodes. The mount prefix for these systems is /mnt e.g., /mnt/scratch-lustre etc. Batch jobs should primarily run exclusively on the lustre filesystems.

/home

Home directories are currently capped with a 10GB quota.

Users are asked to not use /home for job output'. Since it is served from the cluster head node, having large parallel jobs write to it simultaneously will cause performance issues for the entire cluster. There is no problem with short and infrequent accesses, like reading a small configuration file or submitting your job from it, for example, however heavy use in terms of either reading/writing a large file or very many small ones rapidly is problematic. If we see that you are overloading the home filesystem and causing slowdowns for the entire cluster/other users, we may have to kill your job, so please see below for the appropriate heavy-duty filesystems.

/mnt/raid-cita, raid-project

Main page (http://squirrel.cita.utoronto.ca/mediawiki/index.php/Disk_Space#raid-cita)

These are fast parallel filesystems that are available to Sunnyvale nodes. All users have at least 10 GB available on /mnt/raid-cita, except for the SRAs, post-docs and staff who have 2TB by default. If you have exceptional data requirements, please contact requests@cita and your quota can be increased. The raid-project space is overseen by CITA faculty and permission from individual researchers is necessary to get a directory on these filesystems.

You no longer need an other=raid-cita directive to mount disks.

/mnt/scratch-lustre

Main page (http://squirrel.cita.utoronto.ca/mediawiki/index.php/Disk_Space#scratch-3week)

This scratch is a large, fast parallel filesystem. It is meant to be used for temporary intermediate and final output of cluster jobs. This filesystem is for short term storage and postprocessing only and archival data should be migrated to raid-cita or raid-project as scratch is not backed up in any capacity so data stored on it is not guaranteed against loss (see the Disk Space page for more information).

External storage

The desktop home spaces home-1, home-2, and home-3 are only visible on the head nodes.

/tmp

DO NOT use /tmp to store any files! This filesystem is part of the scratch disk on each node and has to have enough space free for various system daemons to function properly. Any files found in /tmp will be deleted without warning.

/mnt/node_scratch

Local scratch disks are available on all nodes at

/mnt/node_scratch/<username>

They are 67 GB in size, although check how much disk is available.

If you are I/O limited by the cluster file systems then you may be better off using these disks. Use sparingly, as this can severely degrade network performance.

Accessing


They can be accessed on other nodes as /mnt/scratch/<node>., where <node> is the host name of the node, e.g. tpb78. NOTE: autofs access to nodes is currently turned OFF. It can be reactivated if necessary - jjd - 2014-03-18

File Deletion

You MUST tidy up after your jobs. rm all the local_scratch files that you create at the end of your job.

These scratch disks will be purged daily for all files older than _one_week_.

WARNING: if the disks are chronically filled, a more severe method will be used. The system will automatically remove all the user's files after the end of the job.

/mnt/scratch/tpbNNN

All nodes export their scratch disk /mnt/node_scratch to all other nodes. They can be accessed at /mnt/scratch/tpbNNN. NOTE: autofs access to nodes is currently turned OFF. It can be reactivated if necessary - jjd - 2014-03-18

PBS

Sunnyvale uses the TORQUE batch system and the Maui job scheduler to process user jobs. This system allows users to submit a batch script (a shell script with a special header) that describes their computing work, which is then held in a queue and run non-interactively once sufficient resources are available.

This software is also used at SciNET, as such users should be able to port their batch scripts to Sunnyvale with little modification.

PBS Limits

The batch system will limit jobs to 96 nodes (784 cores) if the cluster is busy and 128 nodes (1024 cores) if the cluster is idle - though these limits can change without notice.

Jobs are limited to 48 hours in duration.

If you have special job requirements please contact requests@cita.utoronto.ca to apply for a reservation on the cluster, but cluster reservations are provided sparingly and usually only in exceptional circumstances. We prefer users make do with the cluster as-is i.e. if you have a job that needs to run for more than 48 hours, find a way to add 'checkpoints' into its execution where you can run a 48-hour job, have it conclude, then submit a new job that continues where the last one left off using some sort of intermediate files to carry over progress. This policy is in place to keep things fair and let every user get a chance to run their jobs.

Submitting Jobs

Jobs are submitted to the queue by executing qsub $BATCH_SCRIPT_NAME.

The following is an example of a batch script for a parallel job that would use 40 cpus (5 nodes) for 6 hours using OpenMPI:

 #!/bin/bash -l
 #PBS -l nodes=5:ppn=8
 #PBS -q workq
 #PBS -r n
 #PBS -l walltime=06:00:00
 #PBS -N my_job_name
 # EVERYTHING ABOVE THIS COMMENT IS NECESSARY, SHOULD ONLY CHANGE nodes,ppn,walltime and my_job_name VALUES
 cd $PBS_O_WORKDIR
 mpirun a.out

NOTE: the first line will initialize the script with your ~/.bash_profile and ~/.bashrc login scripts. If your login shell is /bin/tcsh, then start your scripts with this line instead:

#!/bin/csh
 

The first 5 lines are mandatory and must appear in the script. Set the number of nodes in nodes and processors per node in ppn. There are four queues: workq, fastq, c6100q, greenq. Note: if you are running single core serial jobs all you need to do is set ppn=1 above and the batch system will find a node for your code to run on.

For users who just want to use the nodes for serial jobs, an example batch script asking for 1-core on 1-node might be as follows:

 #!/bin/bash -l
 #PBS -l nodes=1:ppn=1
 #PBS -q workq
 #PBS -r n
 #PBS -l walltime=06:00:00
 #PBS -N my_job_name
 # EVERYTHING ABOVE THIS COMMENT IS NECESSARY, SHOULD ONLY CHANGE nodes,ppn,walltime and my_job_name VALUES
 cd $PBS_O_WORKDIR
 ./my_serial_programme my_arguments

In principle, you might submit dozens of jobs like this to complete a specific task.

There is a 48 hour limit for jobs running on Sunnyvale, but the walltime option can (and should) be set to less than this as the scheduler will advance your job to the top of the queue faster if it doesn't require as much wallclock time. Efficient job scheduling improves job throughput.

Since there are 8 cores / node users will want to specify ppn=8 if they would like to have nodes fully allocated to their own job. Submitting with ppn<8 will allow for multiple jobs to be assigned to the same node and could result in the memory being oversubscribed and both jobs failing and/or crashing the node. If your serial jobs are memory limited a good rule of thumb is to specify ppn=(total job memory / 485MB) to prevent this. If each process in your run requires more than 1/8th the memory, the trick is to use the -loadbalance flag with OpenMPI jobs, e.g., say your code could only used 3 cores/node because of memory requirements and you wanted to submit a 8-node/24-core job. You would then specify ppn=8 in the batch script but submit the MPI job with:

mpirun -np 24 -loadbalance your_prog ...

The following PBS environment variables are useful for tailoring batch scripts:

Variables that contain information about job submission:

 PBS_O_HOST            The host machine on which the qsub command was run.
 PBS_O_LOGNAME         The login name on the machine on which the qsub was run.
 PBS_O_HOME            The home directory from which the qsub was run.
 PBS_O_WORKDIR         The working directory from which the qsub was run.

Variables that relate to the environment where the job is executing:

 PBS_ENVIRONMENT       This is set to PBS_BATCH for batch jobs and to PBS_INTERACTIVE for interactive jobs.
 PBS_O_QUEUE           The original queue to which the job was submitted.
 PBS_JOBID             The identifier that PBS assigns to the job.
 PBS_JOBNAME           The name of the job.
 PBS_NODEFILE          The file containing the list of nodes assigned to a parallel job.

Queues

Sunnyvale has four batch queues.

workq: General purpose queue with approximately 190 Dell Power Edge 1950 nodes with 8 cores and 4GB of memory. Most jobs are submitted here. The upper limit on the number of processes per nodes is ppn=8.

fastq: This is a queue made of newer and faster Dell R410 8-core nodes with 24GB of memory. This queue is suitable for jobs requiring more memory per node and slightly greater speed. The upper limit on the number of processes per nodes is ppn=8.

c6100q: 4x 12-core/48GB nodes - these nodes have a 7 day limit and are suitable for interactive batch queue sessions requiring nodes with lots of memory. Even though the node has 12-cores, the upper limit on the number of processes per nodes is slightly higher with ppn=16.

greenq: 4x 16-core/128GB nodes make up this queue - these are the fastest nodes. The upper limit on the number of processes per nodes is ppn=16.

Remember to add the line e.g., #PBS -q fastq" in your batch script so your job is submitted to these queues.

Submitting Interactive Jobs

To submit an interactive job for 2 nodes, execute the following on a login node e.g.,:

 qsub -q QUEUE_NAME -I -X -l nodes=2:ppn=8

where QUEUE_NAME is the name of one our batch queues, e.g., workq, c6100q, fastq or greenq

This will open a shell on the first node of your job. To see which nodes you have been assigned:

 cat $PBS_NODEFILE

One can load modules, start a process, debug, etc. interactively. The job will terminate once you log out of the node that you logged in on, or once the walltime limit has been exceeded.

NOTE: Please use this method sparingly! Most jobs should be submitted non-interactively. If users leave nodes idle while using them interactively the policy on Sunnyvale will change, as it is a shared resource and tying up nodes is unfair to the other users.

Submitting Jobs on Reserved Nodes

If you want a particular job to run on a reservation made for you then make sure to include the following flag:

-W x=FLAGS:ADVRES:{$RESERVATION NAME}

Monitoring Jobs

bobMon System Monitor

The web-based graphical System_Monitor is now available for Sunnyvale. If you are local, point your browser to bobMon (http://doug.cita.utoronto.ca/bobMon).

If you are logging in remotely you will have to create an ssh tunnel to barb to forward the http port. To do so, when you ssh into gw.cita.utoronto.ca add the following to your ssh command: ssh gw.cita.utoronto.ca -L 10101:sunnyvale:80 -- this connection has to stay active, so don't close it. Then you should be able to view it in your local browser with tunneled bobMon (http://localhost:10101/bobMon/)

Command Line Monitoring

Upon submission users can monitor their jobs in the queue using qstat -a. The showbf command will show resource availability, and the showstart command will suggest when a particular job will begin. Additional queue and total cluster node usage can be displayed with showq. NOTE: there may be a discrepancy with the number of CPUs used vs. the real cluster ussage. This is a bug with showq and the particular method some individuals requests jobs. The scheduler should work as advertised.

A brief synopsis of all the above is available by running freeNodes.

Job Identifier and Deletion

Each job submitted to the queue is assigned a $PBS_JOBID. This is the number on the left-handside of the qstat output. If a user would like to stop a running job or delete it from the queue then they should run qdel $PBS_JOBID.

PBS Output Files

After a job has finished PBS will place an output file and an error file in the directory where the job was submitted. These files are named job_name.o$PBS_JOBID and job_name.e$PBS_JOBID, where the "job_name" is the value supplied with #PBS -N job_name in the batch script.

The *.e* file will contain any output produced on standard error by the job (that was not redirected inside the batch script), and the *.o* file will contain the standard output. In addition, the pbs output file will list the amount of resources requested for the job, the amount of resources used, and a copy of the batch script that was submitted to run the job.

If you don't recieve a PBS job report it is likely that the disk where your script was launched was unavailable when the job terminated ( /cita/d/scratch-3month in particular). These can be found in /var/spool/pbs/undelivered/ on either of the login nodes. It is suggested that users submit all batch jobs from their /home directories to avoid this. If you'd still like your job to run from a different disk then put a change directory command at the start of your script (after the #PBS header!).

What nodes are my job using?

If your job is actively running one can get a list of nodes it is using with jobNodes PBS_JOBID

Alternatively one can put cat $PBS_NODEFILE in their batch script, which will print out the list of nodes that the job is using in the PBS standard output file, and this can then be used.

How do I check the status of my nodes?

One can get a brief summary for each node at the command line by issuing gstat -a | head -10; gstat -a | grep -A1 tpb<tpb#> for each of the nodes (<tbp#> is just the numerical portion of the nodes hostname, ie tpb123.sunnyvale -> <tpb#>=123 ).

It is suggested that users monitor their jobs via bobMon, but if you'd like detailed information you can look at the ganglia (http://julian/ganglia) data for each node. If you are not on the cita network you will have to create an ssh-tunnel to port 80 on julian, as is indicated for bobMon, and using this ganglia via tunneling (http://localhost:10101/ganglia).

Programming Issues

64 bit-isms

Sunnyvale is an em64t/amd64/x86_64 (64 bit) platform. Be aware of the following:

  • FFTW plan variables must be 8-bytes
  • ld: skipping incompatible message during linking
    • the linker is finding 32 bit rather than the required 64 bit libraries. Most or all of the 64bit libraries should be found in /usr/lib64/
  • relocation truncated to fit: R_X86_64_PC32
    • this only seems to be a problem with the Intel compiler.
    • can occur when trying to compile code that contains >2G of static allocations with the Intel compilers. The suggested fix is to compile with -shared-intel
    • if this doesn't work, you may have to change the mcmodel as well. (man ifort, search for mcmodel). -mcmodel=medium should suffice for all cases.

Compilers

Intel (ifort/icc/icpc), GCC3 (g77/gcc/g++) and GCC4 (gfortran/gcc4/g++4) compilers are installed. Different versions of the intel compiler will get their own module.

The following compiler flags may be of use:

 -O3 for full optimizations, -O0 for none
 -[a]xT (intel only) for processor specific optimizations
 -CB (ifort) -fbounds-check (gfortran) for runtime array bound checking
 -openmp (intel) -fopenmp (GCC4) to use OpenMP 2.0

Run man $COMPILER_NAME for further details.

The ifort version 10.0 beta documentation can be found here (http://www.cita.utoronto.ca/~merz/intel_f10b/Doc_Index.htm).

The icc/icpc version 10.0 beta documentation can be found here (http://www.cita.utoronto.ca/~merz/intel_c10b/Doc_Index.htm).

Debugging

Start by compiling your code with the -g flag to produce a symbol table. It's also a good idea to remove any optimizations. Then you can launch an xterm and a debugger (idbe/gdb) for N processes on a devel node with:

 mpirun -np N xterm -e gdb a.out

Parallel Programming

Please see the CITA introductory tutorial (http://www.cita.utoronto.ca/~merz/pi/), which contains a brief presentation on parallel computing, a simple tutorial, and links / suggestions on where to find further information.

A simple MPI util Module (f90) is available at [1] (http://www.astro.utoronto.ca/~zqhuang/academic/papers/MPIutils.f90). Also see [2] (http://www.astro.utoronto.ca/~zqhuang/academic/papers/hello-mpi.f90) to understand how to use this Module.

Using Open MPI

Open MPI (http://www.open-mpi.org) is the successor to LAM/MPI (http://www.lam-mpi.org). While we've had a good history with LAM/MPI, it is now in a maintenance state.

Once you are familiar with LAM it's easy to start Using Open MPI. Note the following:

  • only requires the following module:
 module load openmpi

This will load the current default version but do a module list to see what other versions may exist. Newer versions might work better for you.

  • there are no lamboot/lamhalt commands, mpirun starts and shuts down everything automatically. man mpirun for more details.
  • make sure to use the openmpi version of fftw v2.1.5

If you'd like to use the intel compilers, set the Open MPI wrapper compilers to use them in your ~/.tcshrc or ~/.bashrc

 setenv OMPI_MPICC icc
 setenv OMPI_MPICXX icpc
 setenv OMPI_MPIFC ifort
 setenv OMPI_MPIF77 ifort

Using Mpich2

NOTE: In the latest upgrade, we have not built the mpich2 module. If you need this flavour of MPI in particular, please send a note to requests@cita.

mpich2 (http://www-unix.mcs.anl.gov/mpi/mpich2/) is another MPI that is installed on the cluster. If you experience problems with LAM/Open MPI, then you can try mpich2 to see if it will work for your code.

  • requires the following module (make sure to unload mpi/lam/openmpi first):
 module load mpich
  • requires an "mpd ring" which is a set of daemons that runs on each node and facilitates MPI.
    • this is invoked using the mpdboot command, and cleaned up with mpdallexit
  • One also needs to create the mpd configuration file, ~/.mpd.conf. This file should contain an entry like the following:
MPD_SECRETWORD=notthatsecret
    • Note that you should not use your login password for this - it can be any string and is basically irrelevant.
  • Also note that the mpich commands are broken (mpdboot and mpirun in particular).
    • One needs to specify the full path to any files (executables or hostfiles), as it won't find them in the present working directory.

To debug, log into a devel node and;

 mpif77 file.f90
 mpdboot
 # note the full path to a.out
 mpirun -n 8 /home/merz/mpich/a.out
 mpdallexit

A sample batch script for using mpich2 with PBS is:

#!/bin/csh 
#PBS -l nodes=8:ppn=8
#PBS -q workq 
#PBS -r n
#PBS -l walltime=00:35:00
set NNODES=8
set NCPUS=64
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE > nodes
foreach NN (`cat $PBS_NODEFILE | uniq`)
 echo `echo $NN | cut -f1 -d.` >> machinefile.$PBS_JOBID
end
mpdboot -n $NNODES -f $PBS_O_WORKDIR/machinefile.$PBS_JOBID -v
mpiexec -n $NCPUS $PBS_O_WORKDIR/a.out 
mpdallexit
unset NNODES
unset NCPUS

Using ClusterOpenMP

NOTE: This is not set up with the current Sunnyvale configuration. (Dec-2012)

potentially a quick way to write parallel code, as long as care is taken to not randomly access memory.

here is Neal Dalal's experience:


On Wed, Mar 11, 2009 at 01:56:17PM -0400, Neal Dalal wrote:
 Hi Mike,

 Intel describes cluster-openmp at the web page:
 http://software.intel.com/en-us/articles/cluster-openmp-for-intel-compilers/

 That page has a detailed user manual on it.

 On sunnyvale, here's what I did to use cluster-openmp.

 0. First you need a license file from Intel.  I have a license in
 ~neal/intel/licenses/CLOMP_20100630.lic , until you guys get a
 system-wide license.

 1.  Compiling your openmp code with cluster-openmp then just involves
 replacing -openmp with -cluster-openmp.  For example, I compiled the
 following hello_world code with:

 icc -cluster-openmp hello.c -o hello

 ______________

 #include <stdio.h>
 #include <unistd.h>
 #include <omp.h>

 int main()
 {
   int i, n=100;
   char *string;

 #pragma omp parallel default(none) private(i,string) shared(n)
   {    // begin parallel region
     string = malloc(n);
     gethostname(string,n);
     i = omp_get_thread_num();
     printf("hello from thread %d on host %s\n", i, string);
     free(string);
   }    // end parallel region
   return 0;
 }

 ______________

 2.  To run the program, you need a file kmp_cluster.ini in the same
 directory which describes how many threads to use, which hosts to run on,
 etc.  Here are the contents of the file that I used:

 [neal@tpb5 test]$ cat kmp_cluster.ini
 --process_threads=3 --hostfile=hosts --launch=ssh --sharable_heap=100M

 This says that the list of hosts to use is in the file "hosts" in the
 current directory, and that 3 threads per process should be spawned.  It
 also specifies that 100MB on each node is reserved for "shared" memory.

 3.  Here's an example pbs script:

 [neal@tpb5 test]$ cat test.pbs
 #!/bin/bash
 #PBS -l nodes=2:ppn=8
 #PBS -q workq
 #PBS -r n
 #PBS -l walltime=1:00

 cd $PBS_O_WORKDIR
 uniq < $PBS_NODEFILE > hosts
 ./hello > hello.out

 ____________

 This first generates the "hosts" file which lists the hosts to run on,
 and then runs the program.

 Here is the output:

 hello from thread 0 on host tpb5
 hello from thread 2 on host tpb5
 hello from thread 1 on host tpb5
 hello from thread 3 on host tpb4
 hello from thread 5 on host tpb4
 hello from thread 4 on host tpb4

 _____________

 I hope this helps.  There's a lot more information on the web page that I
 mentioned above.  For example, you can start everything up using mpirun,
 instead of making the "hosts" file.

 best,
 Neal



Networking

NOTE: There have been a lot of network changes since the time this part of the wiki was written - most of this info is obsolete so please disregard it. We keep it here as a reference for admins.

Switch located on each rack
Enlarge
Switch located on each rack

thin tree

  • all 40 nodes on a given rack connect to a 48-port gigabit switch
  • racks are linked together by connecting the 5 rack switches to a top-level switch via 4-port trunked links
  • all disks, except for the ones named /mnt/scratch/local or /mnt/scratch/rack?, are effectively connected to the top-level switch via a single gigabit link

This has the following implications for network performance:

  • you get full, non-blocking gigabit networking if all the nodes in your job are on the same rack
  • there is a significant bandwidth hit if the nodes in a job span 2 or more racks;
    • e.g. the 40 nodes on rack 1 effectively share a single 4Gb link to all other nodes in the cluster.
  • any and all nodes which read or write to a non-local disk (like /cita/d/scratch-3month) are sharing a single gigabit connection
  • read/write performance is almost certainly going to be maximized if you can make use of the scratch disks thatare installed in each rack (e.g. /mnt/scratch/rack?).

fat mesh

Cables converging to the Fat Mesh switches.
Enlarge
Cables converging to the Fat Mesh switches.

There is also a fat_mesh network that can be accessed by using the hallowed beer names. This network is composed of switches attached to groups of nodes (5 groups of 8) on each rack (ie, perpendicular to the thin tree). Each of the switches is fully-interconnected with the other 4 fat_mesh switches via 10gigE connections. In theory this should improve off-switch performance by increasing the bandwidth between racks.

The beer names resolve to the thin tree addresses for nodes on the same rack, but use the fat mesh for nodes that exist on other racks.


Which node is in which rack?

  • rack 1: tpb 1-40
  • rack 2: tpb 41-80
  • rack 3: tpb 81-120
  • rack 4: tpb121-160
  • rack 5: tpb161-200

A Note On Latency

As you might guess by reading the network configurations you can get different MPI latencies between different nodes relative to where they are to each other.

When dealing with nodes on the same rack or same fat mesh stripe you are looking at around 0.0001 seconds of latency for point to point communication. Going between nodes on different stripes and different racks the latency slightly increases to around 0.000107s.

When two nodes under full load and maxing out their network with MPI communication the latency jumps to about 0.000160s.

These tests were conducted by averaging out the latencies of millions MPI sends.

Codes

Please feel free to make your own wiki page for any codes that you have run on Sunnyvale -- these hints and performance results are useful to other users!