CubePM

From CITA Computing

Table of contents

About

CubeP3M (http://arxiv.org/abs/1208.5098) is an N-Body simulation code that can be used to study large-scale cosmological structure formation. CubeP3M is the successor to PMFAST (http://www.cita.utoronto.ca/~merz/pmfast/). In addition to the features provided by PMFAST it contains support for:

  • gas evolution through use of a TVD MHD module
  • particle-particle interactions at sub grid cell distances and beyond
  • scaling up to (and hopefully beyond) 20000's of cores, as well as shared-memory parallelization via OpenMP to optimize memory usage on SMP nodes
  • Particle identification tags

For further info, contact Joachim Harnois-Deraps (http://www.cita.utoronto.ca/~jharno).

General Requirements

CubeP3M is written in Fortran.

  • In order to compile CubeP3M (and all the assorted utilities) without source modification you will need binary file support in the compiler (Intel/Compaq/^...). One can compile with unformatted files by omitting the -DBINARY flag in the makefile, but one needs to pay attention whether it is imlemented in the post-processing routines. Full gfortran support is on the todo list... :-(

The code has been succesfully ported to the IBM machines on SciNet, with the use of the IBM xfl compilers.

  • In order to use threading the compiler will have to have support OpenMP v1.0 and above.

It requires an MPI-1 implementation (LAM, MPICH, etc)

It uses the FFTW (http://www.fftw.org) 2.1.5 library.

  • Make sure it is compiled with single precision, and have the MPI and thread enabled.
  • Since the binaries are often rather larger in order to exploit the available resources, you might need the mcmodel=medium flag at compilation time as well.

Transfer functions to produce particular cosmological initial conditions can be generated using the online CAMB (http://lambda.gsfc.nasa.gov/toolbox/tb_camb_form.cfm) toolbox.

Overview of the code

CubeP3M is composed of multiple "utilities" that run in a pipeline in order to conduct the simulation.

The first step is to generate initial conditions. This is done using the dist_init utility. dist_init is a program that will read in a CAMB (http://lambda.gsfc.nasa.gov/toolbox/tb_camb_form.cfm) transfer function to model the particular cosmology of the simulation. It uses the Zel'dovich approximation to calculate density perturbations which are then applied to the initial particle state. This state is written to disk as a decomposed particle listing for use in CubeP3M.

The main CubeP3M code reads in these initial conditions, which have been prepared for some initial redshift, and then proceeds to evolve them in an expanding universe until the desired final redshift has been obtained. CubeP3M runs in parallel and is based on a cubical decomposition of the simulation volume. During the evolution one may produce distributed particle list checkpoints of the simulation, halo catalogs, and density projections, all of which can be further refined using post-processing utilities:

  • cic_power generates gravitational power spectra using distrbuted particle checkpoints in parallel
  • pgm_proj generates pgm images of the density projections
  • halo_merge reduces the distributed halo catalogs to single files
  • PSvsSim produces a comparison of the simulation halo mass function against the Press & Schecter and Sheth & Tormen theoretical predictions
  • recompose reduces the distributed particle checkpoints to single files
  • dm_slice_sample produces a particle listing corresponding to a thin slice through the simulation volume (not maintained).

In addition, CubeP3M, can be used with an optional TVD MHD module to simulate the baryonic component of matter. dist_init supports the creation of this initial gas field, which can be read into CubeP3M and evolved alongside the dark matter particles. The gas field is checkpointed along the particles, and post-processing routines are included to calculate the power spectrum and density projections in 1 and 2 dimensions.

CubeP3M can also be run using particle-particle interactions on scales larger than 1 particle mesh cell in length. It does take more time to run, but the gain in accuracy is significant.

Basic CubeP^3M algorithm

The code performs the following:

initialization

  • initialize MPI
  • initialize global variables
  • initialize gravitational kernels
    • both fine and coarse mesh
  • initialize particle listing
    • read in from distributed checkpoint or IC files, or computed using an arbitrary method

main PM loop

  • calculate timestep to use for current iteration
    • calculate expansion using Friedmann equation
    • limited by maximum accelerations from previous timestep, expansion constraint
  • update positions using leapfrog method in half-timestep increments
    • merge the update from the final half of the previous step with the first half of the current step, as the velocities of the particles do not change (just dt)
  • construct a linked list for the particles based on their location within the local volume. This dramatically speeds up searches through the particle list to find particles in a given coarse cell
    • a separate linked list is created for each coarse mesh grid cell (4^3 fine mesh cells). A "head-of-chain" array (hoc) contains the index of the first particle in that cell and a 0 if no particles exist in the cell. Then each particle in the linked list (ll array) points to the next particle, and so on, until the final particle which points to 0
  • pass particles amongst nodes
    • particles that move outside of the local physical volume need to be passed to adjacent nodes, as well as copies of the local particles that are required for buffers
    • each dimension is treated independently and done synchronously, reducing the number of communication messages to 6 from a possible 26 neighbors. As each pass is done the particles are added to the linked list.
  • calculate fine mesh accelerations and apply to particles
    • each fine mesh tile is computed independently by solving the Poisson equation in fourier space
      • first compute density of tile using either NGP or CIC allocation scheme, and fourier transform it
      • take the convolution of the fourier-space density and the fourier-space gravitational potential/force kernel
      • transform back to real space to obtain either the gravitational potential or force field.
      • apply the accelerations to the particles using the same allocation scheme as above (CIC or NGP)
      • calculate particle-particle forces between particles within a fine-mesh grid cell at this point and update their velocities with the resulting acceleration
        • optionally extend the range of this pp force outside of a single fine mesh cell.
  • calculate the coarse mesh (long-range) accelerations and apply to particles
    • the entire coarse mesh corresponding to the local nodes volume is done at once
    • calculate the coarse mesh density, convolve with the coarse mesh gravitational kernels to get the potential/force field, and then apply to particles.
    • FFTs in this step are done in parallel by all nodes at once.
  • optionally update particle positions to the end of the timestep and perform a checkpoint/projection/halo catalog
    • if we have reached the final redshift, end simulation
  • delete particles outside of physical volume
  • repeat until final redshift is reached

finalization

  • clean up MPI/FFT library

Decomposition

particle list and mesh (volume) decomposition amongst nodes
Enlarge
particle list and mesh (volume) decomposition amongst nodes
fine mesh decomposition within a node
Enlarge
fine mesh decomposition within a node

The decomposition of the simulation volume dictates how various parameters are set for the simulation and limits permissible configurations. CubePM contains 2 major levels of decomposition that correspond to a coarse and a fine mesh. The superposition of the forces produced by each of these mesh levels, as well as an optional particle-particle force, produces the total force on a particle in a given iteration of the code.

The simulation volume is evenly decomposed into cubic sub-sections. As such CubeP3M must be run on an integer-cubed number of nodes. Each cuboid corresponds to the local section of the mesh, which is calculated in parallel by all nodes during each timestep. Since the coarse mesh uses a mesh 4x smaller in cells / dimension than the fine mesh (which can be considered as the equivalent 1-level particle-mesh resolution), the total parallel workload is 64x less than the fine mesh simulation volume and doesn't suffer the performance bottleneck of having to calculate massive distributed FFTs.


Since the decomposition is static in volume and the particle distribution becomes unbalanced in late-epoch evolution, nodes might a) run out of memory to store their local particles and b) slow down as every node will have to wait to synchronize with the node that has the most particles.

Each nodal cuboid is then decomposed into a number of fine mesh cuboid tiles for calculating the fine mesh forces. This is done to provide distinct work units for each of the threads on a node to process in parallel, as well as to decrease the amount of memory used to store the arrays associated with the fine mesh gravitational calculation.

Permissible simulation sizes (particles/mesh sizes, not cosmological volumes) are dependent on the gravitational force/potential kernels that are used to match the fine and coarse mesh forces. These kernels are numerical tables and can be constructed to allow for either cloud-in-cell or nearest-grid-point mass/force assignment and the ability to use either a force or potential calculation to determine the applied accelerations. In general we use a fine-mesh buffer that is 24 cells in extent to accurately calculate forces within the local volume. There are also restrictions on the simulation size due to the decomposition of the coarse mesh volume, which is necessary to do the coarse mesh FFTs in parallel. In most cases one must use the FFTW 2.1.5 library, which only supports slab decomposition (ie: decomposition in 1 dimension), although if a cuboid decomposition FFT is available (IBM BlueGene library, for example) it will allow for more flexibility in permissible values.

Halo finding algorithm

The halo finder in CubeP3M is a basically a spherical overdensity halo-finder. It runs in-line with the code, and each node independently finds halos within it's local volume. Each fine mesh tile is considered one at a time.

The code first builds the fine-mesh density (NGP allocation in the cubep3m version, CIC for PM-only) for the tile. It then proceeds to catalog all local density maxima within the physical volume (excluding buffers) for the tile. It then uses parabolic interpolation to determine the location of the maximum within the cell, and records the value of the density for the cell and this peak position.

Once the list of maxima is generated they are sorted from highest to lowest. Then each of these candidates is inspected independently starting with the highest peak. We accumulate the grid mass in shells surrounding the maximum until the average density of the halo drops below the overdensity cutoff (halo_odc). As we accumulate mass we remove it from the mesh, so that no mass is accumulated twice. This will mean that sub-halos are effectively consumed by the larger halos. Analytical corrections are applied to the mass of the smallest halos to account for the discritization of the grid.

One of the problems with this approach is that if a halo is larger than the buffer region surrounding a tile it will be artificially truncated. In practice this shouldn't be a problem unless one uses a small box though.

After this stage we do another pass through the halo list, this time using the mass measured above to determine the radius of the halo, considering it as a uniform sphere with the given overdensity. All particles within the radius are used to calculate more elaborate halo characteristics.

The halo catalog starts with an integer(4) that indicates the number of halos for the given node, followed by an entry for each halo containing:

halo_pos(1:3) -- position of halo in global coordinates determined via parabolic interpolation from the mesh density
x_mean(1:3)   -- center of mass of the halo determined from particles within radius of halo
v_mean(1:3)   -- average velocity of halo 
l(1:3)        -- angular momentum of halo
v_disp        -- velocity dispersion
radius_calc   -- radius calculated by using the halo mass from the mesh
halo_mass     -- halo mass calculated from the initial mesh calculation
imass*mass_p  -- halo mass calculated by accumulating all particles within radius_calc

More kinematical variables are available, like the energy of the 10 most bound particles, an inertia matrix, etc. The parallel halo_merge utility can be used to merge the catalogs from all of the nodes into a single catalog for each redshift. One has to make sure that the number of columns matches that of the run time halofinder.


Ilian Iliev (http://www.cita.utoronto.ca/~iliev) also wrote a code (utils/PSvsSim/PS_ST_sim.f90) that can be used to plot the halo mass function versus the Press-Schecter and Sheth-Tormen analytical predictions for each of the merged catalogs (it's a serial program, so one can run it at any point after the simulation is completed).

Threading

As we are seeing more and more clusters that are composed of multi-core nodes, this presents a problem for weak-scaling applications like CubeP^3M. One can either try to maximize the memory usage per node, and run only 1 MPI process per node, or else one can maximize the speed of calculation and run an MPI process on each core of the node. Unfortunately, the former is slow and doesn't fully utilize the machine, and the latter reduces the amount of memory that can be utilized for the actual simulation volume since each MPI process requires buffer memory for ghost particles, etc. In addition, running 1 MPI process / core further unbalances the particle load - density variations increase as one decomposes the simulation volume into finer sub-volumes. This causes the simulation to slow down unnecessarily and may force the simulation to end prematurely once memory available for the most densely populated core has been exhausted.

In order to maximize memory and cpu usage, we have implemented a hybrid OpenMP/MPI version of CubePM that uses threads within a multi-core node to process the time-consuming fine mesh force update. As the fine mesh tile size can be made relatively small, the memory overhead of having a separate set of fine mesh arrays for each thread is much less than that required to use multiple MPI processes / node (primarily due to the large buffer region required for the particle list).

The basic idea is to use each core to process a different fine mesh tile. Additionally, other time-consuming loops in the rest of the "serial" code have been threaded, which contributes to an additional speed-up. One must keep in mind, though, that not all pre/post-processing utilities are threaded.

Particle-particle interactions

In order to resolve forces between particles at sub-grid cell distances, the particle-particle interactions need to act on all sub-grid particles. In order to avoid particles feel the force of their neighbor multiple times, an NGP force kernel is used to calculate the fine mesh force. This kernel has the property that forces amongst particles within a grid cell are 0, and provides the simplest method by which to match the particle-particle force with the fine mesh force.

We calculate an additional PP force during fine mesh force velocity update, which allows us to avoid loading the particle list twice and allows us to thread the operation without any significant additional work. In this routine, all of the particles within a fine mesh tile are read in via the linked list and the appropriate force is applied depending on where they exist within the fine mesh tile. As the linked list chains particles within a coarse mesh cell, for the particle-particle interaction we proceed in constructing a set of fine-cell linked list chains for each coarse cell. We then proceed by calculating the particle-particle force between all particles within a fine mesh cell, limiting pairs to those exceeding a particular softening length to prevent scattering. As this proceeds, we also accumulate the PP force applied to each particle and then determine the maximum force applied to a particular particle, which is then used to constrain the length of the global timestep.

The biggest problem with this method is that it is anisotropic and depends on the location of the fine mesh with respect to the particles. An example are two particles on either side of a grid cell boundary. These particles will experience the 1-grid cell separation NGP mesh force, but if the mesh were shifted such that they were within the same cell, they would experience a larger force. On average we assume that this balances out, and we have even tuned the NGP force kernel to provide as unbalanced of a force as possible (over a statistical ensemble of particles) at grid cell distances by running large numbers of pairwise force comparisons at different separations and positions relative to the mesh.

This effect is even more pronounced at the initial stages of the simulation where the density is more homogeneous, and leads to mesh artifacts appearing in the density field. In order to minimize this effect we have taken to shifting the particle distribution relative to the mesh by a random amount, up to 2 fine grid cells in magnitude, in each dimension, at each timestep. This adds negligible computational overhead as it can be applied during the particle position update and suppresses this behavior over multiple timesteps.

One of the new major developments for the code is the extension of this PP force to distances greater than 1 grid cell. This slow down the calculation, but due to the r^2 nature of the force, the matching between PP and the fine mesh is greatly improved.

Obtain, Compile and Run!

Modules

On Sunnyvale make sure you are loading the following modules for lam and the intel compilers:

   module load intel/intel-10.0b.017                                     
   module load lam/lam-7.1.3-intel
   module load fftw/2.1.5-intel10

On the Scinet GPC cluster, you will need:

   module load intel/12.1.3
   module load intelmpi/4.0.3.008
   module load fftw/2.1.5-intel-intelmpi4

Get the Source

The latest source code is on github (https://github.com/jharno/cubep3m).

One can also download a tar ball at

  www.cita.utoronto.ca/~jharno/cubep3m

but should contact code developers to insure it is up to date.

Calculating Memory Usage and Simulation Size

Enter the batch directory:

 cd ~/cubep3m_threads/batch/

If you're compiling from source change the root cubepm directory accordingly. You will also have to build the mem_usage binary - just running ./init_pp_threads.csh in the batch directory should suffice.

Run mem_usage to figure out how large of a simulation you want to run:

 >./mem_usage
  enter nodes_dim
  4

This will set the number of nodes of the simulation to be 64 (nodes_dim^3 nodes total)

  enter cores / node
  8

This is the number of cores / node you'd like to use (sets number of threads)

  enter cells / coarse slab 
  8

This number and nodes_dim determine the mesh size of the simulation. It should be an integer, but not all values will decompose properly. In this case, each coarse slab are 8 coarse cells thick, we have 4**3 = 64 slabs (one per node), for a total of 512 coarse cells, or 2048 fine cells, along one dimensions.


  enter number of fine mesh tiles / node
  4

This will determine the granularity of the work-chunk size assigned to each thread. More tiles = smaller work chunk (and less memory usage) but it increases the amount of time spent re-calculating buffer regions. Set to taste.

  enter fraction of density to allow for buffer (1.0 == no buffer, 2.0 == 2x avg 
  density, etc)
  1.5

This sets the size of the particle list and associated buffers. It can be any real. In general, smaller boxsizes (larger fluctuations) and late evolution require more buffer space.

After entering your desired values, the code will return various parameters relating to the simulation. The first group contain information about the decomposition and total simulation size. The number fine cells / tile should be used for the value of nf_tile, which we will get to shortly.

number fine cells / tile =         176
total simulation volume =        2048
coarse mesh volume =         512
total number of particles =  1.0737418E+09
maximum particles in particle list =    32928000
maximum buffer particles =    10976000

The second group contains the memory usage for large arrays in the code. The sum of these should give a rough idea as to how much memory you are going to use:

size of fine mesh density=   336.5312    
size of fine mesh kernel=   10.51660    
size of fine mesh force=   201.1414    
size of buffer particle list=   502.4414    
size of particle list=   879.2725    
size of density projections=   48.00000    
size of gas and magnetic field=   4241.694    

The gas and magnetic fields are not allocated unless you are going to do a simulation using the TVD MHD module (currently not threaded), so you can ignore the last value in this case. This simulation would use approx. 1876MB of memory per node, so you could try to increase the number of cells / coarse slab to increase the resolution.

Configuring and Building the Code

In the batch directory, edit parameters. Hopefully it is well enough documented. In particular set nf_tile to be the value of "number fine cells / tile" reported by mem_usage.

Now edit batch_parameters.csh, followed by cubep3m.pbs. In particular NUM_NODES_COMPILED should be set to nodes_dim^3, as well as the value of nodes in cubep3m.pbs:

 #PBS -l nodes=?:ppn=8

Once that is done, run init_pp_threads.csh to compile all of the utilities and cubep3m itself. It's a good idea to check the memory footprint of each of the codes that will run in cubep3m.pbs with the size command to make sure they are not too large for the amount of memory on the nodes.

Running the Job

 qsub cubep3m.pbs

Then have a tasty beverage and watch it churn out logfiles in the batch directory (if all goes well).

Using the github Repository

A github (https://github.com/jharno/cubep3m) repository has been created in order to centralize the modification made to CubeP3M by many contributor. Since CubeP3M is a very promising code, we believe that to centralize the effort in that repository would make life easier for everyone, and would grant access to the latest version (or any previous) of the code. Ideally, we would like the main code developers to commit regularly their modifications to the repository, such that the latter is always up to date. There is a good documentation on the github website concerning the (free) creating of a github account, plus the basic steps to setup the access and obtain the code.

Updates

If you want to make sure you are working with an up-to-date version of the code, from your own working branch, enter

 git pull origin

Only a few files will be transfered to you

Commit changes

Once you have modified the code, you are encouraged to submit your changes to the repository. To find out what has been modified since your last update, type:

 git status

You should see which files have been modified. You can choose which files you want to commit to the repository by entering, for each of these:

 git add /path/newfile

Once you are ready, commit your changes:

 git commit -m "Descriptive comments of your modifications"

and send them to the world:

 git push origin

From now on, every user that will do an update or a check out will use your revisions.

Conflict Resolution

The whole point of github is to ensure that every user has a recent copy of the code, plus that everyone's developments are compatible one with each other. For example, if you try to commit your code, you might receive a warning telling you that your code is out of date. That means that somebody else commited changes to the repository between your last update and now. The warning should tell you what is out of date, so you need to enter 'git pull origin' from your code working directory. You then might want to make sure that your modifications are compatible with that newer version, and commit your code again.

If it so happens that the other person was working on the same files as you were, you will need to compare the two versions of that file and decide what should the final file looks like. You might want to contact the person that submitted these modifications beforehand.

Hybrid N-body + MHD

To run the code with a baryon fluid, one needs to include the mpi_tvd_mhd.f90 module at a few places in the code. This has been incorporated in a new Makefile: makefile_mhd, and cosmology can be modified within the 'parameters' file with 'omega_b'. In addition to incorporating the MHD module, this makefile has extra flags that make sure that the N-body particles do not drift at each time step with respect to the baryons, hence the -DDISP_MESH flag is complemented with a -DMOVE_GRID_BACK flag. The MHD force also needs to calculated with a consistent way compared to dark matter, hence the coarse force interpolation is CIC and the fine force is NGP.

Initial conditions

The initial condition generator for hybrid simulations reads a unique transfer function from CAMB, but assigns the dark matter and baryons part to the corresponding column. On has to replace dist_init_dm.f90 by dist_init_hybrid.f90 in the compilation of the initial conditions. Also be aware that for the hybrid code to require no extra memory, it cashes more data to disk, so the I/O is about 50% larger.


Coming soon

See the to-do list on the developers page

CMB Coupling

Baryons can also be coupled to the CMB by enforcing a temperature over-ride. This needs to be specified in both the initial conditions and in the fine_velocity_mhd calculations. This is still under development, and the next stage is to add the density dependence of the coupling between CMB and baryons.

CUBEP^3M + GPU

The pp interactions, which are the main bottle neck in the calculations as soon as massive clusters begin to form, can be sped up by computing the pairwise force on a GPU machine.

B-field generations

The origin of large, coherent magnetic fields in the inter-galactic medium is still uncertain. We are in the process of incorporating B-fields options in hybrid N-body + MHD, as being sourced at the center of massive enough halos, at run time. Primordial B-field generation is trivial and currently available.

Testing the Code

Over the years, many tests have been performed by many groups to assess the systematic uncertainties related to CubeP3M.

Force Tests

The most usefull are probably measuring the force of gravity between two particles randomly paced on the mesh, and between a particle and a full grid. In order to activate these tests, one first has to set 'cosmo = .false.' in the cubepm.par file and thus shut down the time evolution. Then one has set either 'pairwise_test = .true.' or 'superposition_test = .true.' in that same file.

Pairwise Force Test

The code places two particles on the grid and computes the force between them, as well as deviations from theory in the radial and transverse direction to the pair.

Density (hole) Force Test

The code reads in a particle list, and at the next checkpoint, stores the force on all particles, removes a particle, and compute the force again. By superposition principle, the force difference is equal to the interaction between all objects and the 'hole'. For this test to run, on has to switch off the random offset betwen the particle and the mesh, otherwise distances are also randomized (this is done by removing the -DDISP_MESH flag). Therefore, the best way to enable this test is to restart a simulation that already has a particle dump, and evolve it until a later checkpoint. This test produces a massive number of pairs, hence one should be carefully selecting the size of the simulations in which this test is performed.

General Comments when Testing

Generally speaking, one can get a good sense of the simulation's accuracy by looking at the power spectrum and halo mass functions. For developpers: a good advice is to test your code modifications against a template, using the same initial conditions and random seeds. The later can be done with the -DREAD_SEED flag in the makefile.

More Options

Here are a miscellaneous set of options one has when running the code. These are generally set in the cubepm.par file that resides in the 'source_threads' directory.

max_nts

Maximum number of time steps allowed. Default is 4000, but it could be more or less.

ra_max

Constraints the maximum time step from expansion. Default is 0.05. Smaller values place tighter constraints, and produce longer but more accurate runs.

pp_range

Specifies the range of the pp force outside the sub-grid level. pp_range = 1 means the pp force is active up to a layer of one fine mesh cell about each particle.

wde

Dark energy equation of state. Default is -1.0.

restart_ic

Activates restart is set to .true. Then one needs to specify the line in the checkpoint file at which the restart is desired.

rsoft

Softening length, below which no force is active. Default is 0.1 fine mesh cell.