MPI Hints

This page gathers together various hints for using MPI on COSMA. See also the Code-specific information page for code-specific hints.

Standard module combinations are:

intel_comp/2018 and intel_mpi/2018

intel_comp/2025.3.0 (or other latest version) and then (following the instructions printed out) compiler-rt tbb umf compiler mpi.

Note, when using the newer intel_comp modules, you first load these, which put some more modules on the search path, allowing you to then “module load compiler mpi” to load the compiler and mpi (compiler-rt and tbb may also be needed, depending on version).

oneAPI/* (which is an alias for intel_comp)

gnu_comp and openmpi (various versions)

OneAPI / Intel 2024 Compiler

There are modules to run Swift with the Intel 2024.2.0 compiler and Intel MPI or OpenMPI 5.0.3. To use them, load the compiler:

module load intel_comp/2024.2.0 compiler-rt tbb compiler

Pick Intel MPI or OpenMPI:

module load mpi

OR

module load openmpi/5.0.3

Pick an optimized FFTW:

  • (Cosma5): module load fftw/3.3.10

  • OR (Cosma7): module load fftw/3.3.10cosma7

  • OR (Cosma8): module load fftw/3.3.10cosma8

and load the other dependencies

module load parallel_hdf5 parmetis gsl

(note, there may be specific version of these that improve performance, check using the module avail command)

The Intel modules seem to have a bug which prevents them from unloading properly. That’s a problem because it means you can’t use ‘module purge’ in scripts to get to a known state.

mvapich

This is not regularly tested.

    module purge
    module load intel_comp/2018 mvapich2 fftw/3.3.7
    module load parallel_hdf5/1.10.3 parmetis/4.0.3
    module load gsl/2.4

Note, the fftw build is optimised for avx512 / cosma7.

Compiling openMPI with ROCm

Some success has been achieved using the following instructions on a node with ROCm installed (one of the AMD GPU nodes):

  • Compiling UCX

../configure --prefix=/cosma/apps/PROJECT/USER/ucx-1.19.0 --with-rocm=/opt/rocm --without-knem --without-cuda --enable-mt

  • Compiling UCC

../configure --prefix=/cosma/apps/PROJECT/USER/ucc-1.6.0 --with-rocm=/opt/rocm --with-ucx=/cosma/apps/PROJECT/USER/ucx-1.19.0

  • Compiling OpenMPI

../configure --prefix=/cosma/apps/PROJECT/USER/openmpi-5.0.9 --with-rocm=/opt/rocm --with-ucx=/cosma/apps/PROJECT/USER/ucx-1.19.0 --with-ucc=/cosma/apps/PROJECT/USER/ucc-1.6.0 --without-cuda

Pinning when running hybrid MPI/OpenMP jobs

Performance improvements can often be achieved by specifying the mapping of MPI tasks to cores.

For both OpenMPI and Intel MPI, it is helpful to set:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores
export OMP_PROC_BIND=true

OpenMPI

For example, performance improvements have been achieved using:

mpiexec --map-by ppr:${SLURM_NTASKS_PER_NODE}:node:pe=${SLURM_CPUS_PER_TASK} --bind-to-core <executable> <options>

with 32 MPI ranks and 4 threads per rank.

The binding can be reported using –report-bindings.

Intel MPI

For Intel MPI, similarly:

export I_MPI_PIN_DOMAIN=omp:compact
export I_MPI_PIN_CELL=core # ignore SMT threads

The bindings can be reported by setting export I_MPI_DEBUG=4