COSMA8

Using Cosma8

COSMA8 has 2 login nodes, accessed via login8.cosma.dur.ac.uk

COSMA8 has 528 compute nodes, each of which have 1TB RAM and 128 cores (360 are 2x AMD 7H12 processors and 168 are 2x AMD 7763 processors)

There are 2 high RAM (4TB) fat nodes, which should be accessed via the cosma8-shm queue.

There are a number of GPU-enabled servers (see below), a 1TB AMD Milan test node and a 1TB Milan-X test node.

There are 4 relevant SLURM queues:

cosma8: provides exclusive access to nodes, shared with cosma8-serial

cosma8-serial: provides non-exclusive access to nodes. Use this if you want less than 128 cores (and remember to specify your memory requirement too)

cosma8-rome: A subset of cosma8, 360 nodes with Rome processors

cosma8-milan: A subset of cosma8, 168 nodes with Milan processors

cosma8-shm: access to the mad04 and mad05 servers, with 4TB RAM. These is also non-exclusive, so may be shared with other users if you don’t require all 128 cores or all 4TB RAM.

cosma8-shm2: access to the ga004 server with a Milan 7703 processor and MI100 GPU.

Useful information

Numerical libraries

OpenBLAS

OpenBLAS is available via the openblas modules.

GSL The Gnu Scientific Library can be accessed via the gsl modules.

Using MKL with COSMA8

MKL: The Intel Math Kernel Library is known to be hobbled on AMD systems.

MKL is available via the intel_comp and oneAPI modules.

There is a fix that must be applied:

By default, MKL (the Intel Math Kernel Library) does not select the best the best options when used on COSMA8, delivering significantly lower performance. For versions of MKL prior to 2020, setting MKL_DEBUG_CPU_TYPE=5 would force it to use the zen2 code path. However, for newer versions, this no longer works. Instead, the following workaround should be used:

cat <<EOF > amdmkl.c
int mkl_serv_intel_cpu_true() { return 1; }

EOF
gcc -shared -fPIC -o libamdmkl.so amdmkl.c

export LD_PRELOAD=libamdmkl.so

Permanent fix

Rather than remembering to set LD_PRLOAD everytime you run your application, add this to your program using:

patchelf --add-needed libamdmkl.so yourbinary

The information here was taken from here.

Compilers

Recommended compilers will depend on the success seen by your application. See the PDFs below for recommended compiler options on AMD systems. Wisdom about best compilers for particular codes is collected here. Here are the available compilers:

icc

intel_comp/2018 - generally stable

intel_comp/latest - possibly better optimisations

oneAPI - The newest versions of the Intel compiler, aliased to intel_comp

gcc

gnu_comp/ - 10.2 or 11.1 know about the Zen2 architecture, so will be better optimised

aocc (AMD optimised compiler collection)

aocc/ - the AMD Optimised Compiler Collection - based on LLVM, use the latest version

llvm

Available via the llvm modules

pgi

Available via the pgi modules. This may no longer work - if you have problems, please contact us.

MPI Modules

OpenMPI

Usually best to use the newest openmpi module. A version of this with .no-ucx (e.g. openmpi/4.1.1.no-ucx) may offer more stable performance in some cases. However, this is usually not necessary for the newest modules.

Large jobs may suffer from performance issues. This can sometimes be resolved by selecting the UD protocol over the newer DC (dynamical connection) protocol by setting:

export UCX_TLS=self,sm,ud

or

export UCX_TLS=self,sm,ud,rc,dc

in the job script. See discussion here.

If openmpi is complaining about running out of resources (memory pools being empty), the following may help:

export UCX_MM_RX_MAX_BUFS=65536
export UCX_IB_RX_MAX_BUFS=65536
export UCX_IB_TX_MAX_BUFS=65536

(or some larger value).

UCX settings can be seen with: /cosma/local/ucx/1.10.1/bin/ucx_info -f

For Gadget-4, setting export UCX_UD_MLX5_RX_QUEUE_LEN=16384 has also been shown to help.

Intel_mpi

2018 module is the fallback option for SWIFT.

Later versions use UCX underneath, and initially suffered from stability issues. However, the newest versions are much improved. These are loaded using:

module load intel_comp/2024.2.0
module load compiler-rt tbb compiler mpi

Mvapich

The mvapich module can sometimes offer improved performance. However, in some cases, RAM usage can be increased.

Tuning information

Options for compiling on AMD systems

Tuning for AMD systems

GPUs

A number of GPU servers are accessible - please ask if you are unsure how to use these:

gn001: 10x NVIDIA V100 GPUs

ga003: 6x AMD MI50 GPUs (now retired)

ga004: 1x AMD MI100 GPU, 2x 64 core AMD Milan processors.

ga005, ga006: 2x AMD MI200 GPUs, 2x 32 core 7513 EPYC processors

login8b, mad04, mad05, mad06: Between 0-3 NVIDIA A100 GPUs (reconfigurable/moveable as required, please ask if you have a particular setup you wish for)

SWIFT

The current recommended setup is given on the swift pages.

Arm Forge

Allinea Arm Forge and MAP (used for code profiling) is available using the allinea/ddt/23.1.0 module.

Profiles collected during the commissioning period are available in the commissioning report.

SLURM batch scripts

See examples here.

COSMA8 FAQ

The COSMA8 FAQ details some of the known issues or peculiarities related to COSMA8.

How do I submit to COSMA8?

You must be part of a DiRAC project that has an allocation on COSMA8, and be part of the cosma8 group (use the id command to see this). If you are not part of the cosma8 group, but should be, please contact cosma-support. To submit a job script, you need to use -p cosma8.

Jobs failing with UCX ERROR: ivb_reg_mr

Try setting the following option in your SLURM batch script (note, this may impact performance):

export UCX_IB_REG_METHODS=direct

Jobs failing with Bus Error while writing files

This is potentially due to a bug in Lustre 2.12.6. A possible solution is to reduce memory usage. We cannot yet upgrade to newer versions of Lustre, as these have been tested and found to be unstable on COSMA (including 2.12.8 and 2.12.9).

Please let us know if there is something you would like added.

Known code issues

There is collective wisdom available when running particular codes on COSMA8.