COSMA Facilities

COSMA7

COSMA is comprised of:

  • COSMA5,

  • COSMA7,

  • COSMA8,

  • Storage and support hardware (including tape drives for archival).

COSMA also includes a number of other facilities including:

  • the DINE clusters

  • the Cosmological Database

  • GPU systems

  • Other facilities

Storage includes:

  • /cosma5 - 2.4PB

  • /cosma6 - 2.5PB (to be retired in Autumn 2023)

  • /cosma7 - 3.1PB

  • /cosma8 - 16PB

  • /snap7 - 440TB for fast temporary storage (e.g. checkpointing)

  • /snap8 - 1.2PB for fast temporary storage (e.g. checkpointing)

  • /cosma/home - 37TB for user homespace (10GB quota)

  • /cosma/local - 37TB for program/module storage

Users have access to /cosma/home/PROJECT/USERNAME, and one or more of /cosmaN/data/PROJECT/USERNAME.

COSMA5

COSMA5 is comprised of two generations of server:

  • The original COSMA5 system dating from 2012 has approximately 160 compute nodes remaining, each with 128GB RAM and 16 cores (2x Intel Xeon CPU E5-2670 0 @ 2.60GHz). These have a Sandy Bridge architecture.

    • Please submit to the cosma queue to use these nodes.

  • The new COSMA5 system dating from 2024 has 3 compute nodes, each with 1.5TB RAM and 256 cores (2x AMD Bergamo processors). These have a Zen-4 (Genoa) architechture.

    • Please submit to the cosma5 queue to use these nodes.

The nodes are connected via Mellanox FDR10 Infiniband switches in a 2:1 blocking configuration. Access is enabled through three login/development servers, two for the old system each with 512GB of memory (login5a, login5b), and one for the new system with 768GB RAM (login5c). Previously, there was a 2.5PB GPFS file system (DDN ExaScalar). However in early 2020, this was replaced by a 650TB Lustre system (Dell), later updated to 1.6PB, replacing 2.5 racks of equipment drawing 26kW with 1/6th of a rack drawing 1.5kW.

The new COSMA5 nodes were funded by a Durham University carbon reduction fund, and a donation from Dell and AMD, replacing 48 old nodes in 2024.

The new COSMA5 nodes are not exclusive: Codes share the nodes with other jobs running simultaneously, unless explicitly requesting the number of cores contained by each node (256).

COSMA6

COSMA6

COSMA 6 has now reached end of life and was retired in April 2023, after 11 years of operation (first at Daresbury, then as part of COSMA). It had about 600 compute nodes, each with 128GB RAM and 16 cores (2x Intel Xeon CPU E5-2670 0 @ 2.60GHz). COSMA6 therefore included about 10,000 compute cores, based on the Sandy Bridge architecture.

The nodes were connected via Mellanox FDR10 Infiniband switches in a 2:1 blocking configuration. Access was enabled through a login/development server with 512GB of memory. A Lustre storage system configured with 2.6 PB of data space was available.

COSMA7

COSMA7

COSMA 7 has 448 compute nodes, each with 512GB RAM and 28 cores (2x Intel Xeon Gold 5120 CPU @ 2.20GHz) giving it 12544 cores. The CPU is based on the skylake architecture.

A single job can only span half of COSMA7. This is because one have uses a Mellanox EDR InfiniBand fabric, while the other half (224 nodes) uses a Rockport switchless Ethernet fabric.

You should use the cosma7 partition for InfiniBand, and cosma7-rp for Rockport.

The Rockport fabric forms a 6D torus. The InfiniBand fabric has a fat tree topology with a 2:1 blocking configuration.

Access is enabled through three login/development servers each with 1.5TB of memory. Currently connected is a DELL storage system configured with 3.1 PB of data space (/cosma7), and a fast I/O storage with 459TB (/snap7) for temporary checkpointing. Both file systems are based on Lustre.

COSMA8

The COSMA8 prototype system entered service in October 2020. It was then extended in early 2021, to become the initial COSMA8 installation. A phase-2 extension was then installed in 2023, entering service in October 2023 with 528 nodes. It is the first COSMA system to reach Petascale, with ~2PF on the HPL benchmark (RMax), and a ~3PF peak performance. Each node has at least 4.2 TFLOP HPL performance.

It is currently comprised of:

  • 528 compute nodes with 1 TB RAM and dual 64-core AMD EPYC water-cooled processors

    • Of which, 360 compute nodes are Rome 7H12 processors at 2.6GHz

    • and 168 nodes are Milan 7763 processors at 2.45GHz

  • 2 login nodes with 2 TB RAM and dual 32-core AMD EPYC 7542 processors at 2.9GHz

  • 2 fat nodes with 4 TB RAM and dual 64-core AMD EPYC 7702 processors at 2.2GHz

  • 1 AMD GPU node with 6 MI50 GPUs (32GB), 1TB RAM, dual 16-core AMD EPYC 7282 processors at 2.8GHz

  • 1 AMD Milan node with a MI100 GPU, 1TB RAM, dual 64-core AMD EPYC Milan 7713 processors at 2GHz

  • 2 AMD Milan nodes with 2x MI200 GPUs each, 1TB RAM, dual 64-core processors

  • 1 NVIDIA GPU node with 10 V100 GPUs (32GB), 768GB RAM, dual Intel Xeon Gold 5218 processors at 2.3GHz

  • 2 console nodes with a single 16-core AMD EPYC 7302 processor at 3GHz and 256GB RAM

The interconnect is Mellanox HDR, 200GBit/s, with a non-blocking fat tree topology.

Information about using COSMA8 can be found here.

The list of people who helped assemble COSMA8 can be found here

COSMA8 builders

The following people helped build COSMA8:

Yannick Bahe

Alastair Basden

Elijah Basden

Miriam Basden

Peter Draper

Aqeeb Hussain

Yuankang Liu

Fawada Qaiser

Richard Regan

Paul Walker

(if you are not on this list but should be, please let us know, apologies!)

DINE

The Durham Interconnected Novel Environment (DINE) supercomputer is a small 24-node development cluster equipped with NVIDIA BlueField-2 Data Processing Units (DPUs) using a non-blocking HDR200 fabric. These DPUs enable direct access to remote memory to improve the performance of massively parallel codes, in preparation for future exascale systems, and will provide researchers with a test-bed facility development of new and novel computing paradigms.

The cost of data movement - both runtime and energy - is predicted to be one major showstopper on our road to exascale. As computers driving data centres, supercomputers and machine learning farms become faster, their interconnects, i.e. communication devices, grow into a limiting factor; even worse, they also face the omnipresent unreliability that will arise. One way to improve them is to make them smart – to make them learn how to route data flows, how to meet security constraints, or even to deploy computations into the network. Smart network devices can take ownership of the data movement, bring data into the right format before it is delivered, care about security and resiliency, and so forth.

DINE has been funded by DiRAC, ExCALIBUR, the Department of Computer Science and the Institute for Computational Cosmology as part of a strategic research equipment purchase.

Using DINE

Please see DINE notes for information about using DINE.

DINE notes

Using BlueField

A DPU hackathon was held in February 2023, which included many useful tips on how to use DINE. A presentation is available from cosma-support which provides usage examples, namely the lab exercises on slides 45 and 84. Please request UK_DPU_Hackathon.pptx

Network information

DINE has several networks:

Command and control network

Used for login, SLURM submission, etc. Both the hosts and BlueField cards are connected to this network. You can specify as:

b[101-124] for the hosts, and

bluefield[101-124] for the cards

This network is accessible from the login nodes.

InfiniBand network

The high performance (200GBit/s HDR) fabric used for inter-job communication and some file system access. Both the hosts and BlueField cards are connected to this network. You can specify as:

bfd[101-124].ib for the hosts

bfh[101-124].ib for the cards

This fabric is not accessible from the login nodes: it is only accessible to other DINE nodes (hosts and BlueField cards).

Local BlueField card access

From a host, bfl will provide access to the attached card over a slow internal network.

SLURM submission

To submit to DINE, you need to belong to the “durham”, “do008” or “do009” group, and submit such as:

#SBATCH -p bluefield1

#SBATCH -A durham

or

#SBATCH -A do008

etc

SLURM submission to both host (x86) and BlueField device (ARM) cores

To submit a job that will run across both host and arm cores, the following procedure can be used (Mark Turner).

A SLURM script such as:

#!/bin/bash -l
#SBATCH -o smartmpi.out
#SBATCH -e smartmpi.err
#SBATCH -p bluefield1
#SBATCH -A durham
#SBATCH -t 00:30:00
#SBATCH --nodes=2

module purge
module load python/3.6.5

# Get a comma separated list of IPs for the host and
# Smart NICs that SLURM has assigned us
IPs=$( python3 smartmpi/scripts/dine_config.py )
echo "IPs in use: " $IPs

# Assumes alternating topology with 2 ranks per node
# (one on x86; one on arm64)
np=$(( $SLURM_JOB_NUM_NODES * 2 ))
echo "Num processes: " $np

# Prevent SLURM from blocking the use of Smart NICs
unset SLURM_JOBID

mpirun --mca btl_tcp_if_exclude tmfifo_net0,lo,em1 -host $IPs -np $np launcher_script.sh

Where the dine_config.py file is defined as:

import os

def extract_nodes(nodes):
    for node_entry in nodes.split(','):
    elem = node_entry.split('-')
    if len(elem) == 1:
        yield int(elem[0])
    elif len(elem) == 2:
        node_range = list(map(int, elem))
        for i in range(node_range[0], node_range[1]+1):
            yield i
    else:
        raise ValueError('format error in %s' % x)

def print_ips(node_list):
    node_ips = []
    for node in extract_nodes(node_list):
        basenumber = 200
        ip_elem = node - basenumber
        node_ips.append(f"172.18.178.{2*ip_elem-1}")
        node_ips.append(f"172.18.101.{2*ip_elem}")
    print(",".join(node_ips))


    """
    This script is intended for use on the DINE cluster. It should be used within SLURM
    jobs before the mpirun command.

    It prints the comma separated IPs for the x86 hosts and arm 64 Smart NICs allocated
    to us by SLURM. In SLURM I capture this stdout within a variable and pass it to
    the `-host` argument to mpirun when not using a rankfile.

    For full documentation on smarTeaMPI on DINE, see docs/.
    """
    if __name__ == "__main__":
    node_list = os.environ["SLURM_JOB_NODELIST"]
    print_ips(node_list[2:-1])

And the launcher script is (assuming paeno is the code to be run):

#!/bin/bash -l

case "$HOSTNAME" in
"bluefield"* )
export OMP_NUM_THREADS=16
./Peano_bfd/examples/exahype2/euler/peano4 --timeout 300
;;
"b1"* )
export OMP_NUM_THREADS=32
./Peano/examples/exahype2/euler/peano4 --timeout 300
;;
esac

These scripts can be found in /cosma/home/sample-user/dine/

Intel MKL

DINE is an AMD system, and Intel MKL is known to be hobbled. There are some work arounds to improve performance.

Intel Compiler

If you compile with -xHost, you may get:

Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, POPCNT and AVX instructions.

The system

The Durham Interconnected Novel Environment (DINE) supercomputing facility is hosted alongside by COSMA, and is used by Computer Science researchers, DiRAC researchers and international collaborators.

A key feature of DINE is the NVIDIA BlueField smart NIC cards which provide a programmable network offload capability, allowing network functions to be accelerated, and freeing up compute cores for other tasks.

DINE is comprised of 24 nodes each containing:

  • Dual 16-core AMD EPYC 7302 ROME processors (3GHz)

  • 512GB RAM

  • BlueField-2 Smart NIC (200 GBit/s HDR200)

  • These contain 16GB RAM, 8 high clock ARM cores and Ubuntu 20.04

  • NVIDIA HDR200 InfiniBand switch

Students will also benefit from working with cutting-edge technologies, designing algorithms and investigating ideas which will carried forward into future UK and international facilities.

BlueField

Access

Access is available free of charge to the UK research community. High priority will be given to developmental and fundamental Exascale research (no production runs).

Students will also be given access and hence benefit from working with cutting-edge technologies. This will help them to design algorithms and investigate ideas which will carried forward into future UK and international facilities.

We are willing to give collaborators and external scientists access to the system as well to allow them to prototype novel algorithms and write new software using smart network devices.

To get access, please follow these instructions to apply for an account, signing up to project do009, and then send a message to cosma-support@durham.ac.uk mentioning your interest in BlueField.

COSMA8 login nodes should be used for compiling code (also AMD Rome processors). Where native ARM access is required, please create a Slurm job to run on the partition (bluefield1), and then ssh directly to the local BlueField card.

The SLURM workload manager should then be used to submit jobs to the other nodes, using the bluefield1 queue.

If should be noted that DINE has automatic powersaving features - unused nodes will be powered off after 1 our. When a Slurm job is submitted, these nodes will be powered on if necessary, which can take a few minutes.

Hints and tips for usage

The Ethernet network (control, ssh, slurm) has:

Hostnodes, b[101-124]: 172.17.178.[201-224]

BlueField cards, bluefield[101-124]: 172.17.179.[201-224]

InfiniBand is reached via:

bfh[101-124]: 172.18.178.[201-224] for the hosts

bfd[101-124]: 172.18.179.[201-224] for the BlueField cards

The BlueField cards (devices) operate in “Host Separated” mode, meaning that they can be treated as servers in their own right (running Ubuntu), and MPI jobs can run on both host and device.

Currently, manual mpirun calls are necessary to specify the hosts and devices to use.

If you have any hints that you would like to appear here, please let us know!

Associated Projects

ExaClaw - Clawpack-enabled ExaHyPE for heterogeneous hardware

http://www.peano-framework.org/index.php/projects/exaclaw-clawpack-enabled-exahype-for-heterogeneous-hardware/

Durham project funded by EPSRC under the ExCALIBUR programme.

ExaHyPE - an Exascale Hyperbolic PDE Engine

www.exahype.org

EU H2020 FET HPC project with partners from Munich (Technische Universitat Munchen and Ludwig-Maximilians Universitat), Trento and Frankfurt.

Publications

Acknowledgement

This work has used Durham University’s DINE cluster. DINE has been purchased through Durham University’s Research Capital Equipment Fund 19_20 Allocation, led by the Department of Computer Science. It is installed in collaboration and as addendum to DiRAC@Durham facility managed by the Institute for Computational Cosmology on behalf of the STFC DiRAC HPC Facility (www.dirac.ac.uk). DiRAC equipment was funded by BEIS capital funding via STFC capital grants ST/P002293/1, ST/R002371/1 and ST/S002502/1, Durham University and STFC operations grant ST/R000832/1. DiRAC is part of the National e-Infrastructure.

DINE-2

Installed in 2024, DINE-2 is a composable cluster with 8 nodes, 8 GPUs that can be composed to them, and is based on a CerIO composable fabric.

To submit to the DINE-2 system, use the dine2 SLURM partition

The Cosmological Database

The Cosmological Database is a collection of database-stored cosmological information, accessible through different mediums.

This includes:

the Virgo Database

There will also shortly be a SciServer instance available for testing

Rockport cluster

The Rockport cluster uses half of COSMA7, replacing the InfiniBand network with a switchless Rockport Ethernet network, based on a 6D torus topology.

Each node has 28 cores and 512GB RAM, and has the /cosma7 storage (3.1PB attached)

For usage information please see here.

Other facilities

Other facilities available to COSMA users (depending on membership of certain projects, please ask if you’re uncertain!) include:

  • mad01 - a 3TB system with 56 cores

  • cosma7-shm queue: mad02, 1.5TB with 112 cores

  • cosma7-shm2 queue: mad03, 6TB with 48 cores

  • cosma8-shm queue: mad04, mad05, each with 4TB with 128 cores

  • cosma8-shm2 queue: ga004, 1TB with 128 cores and MI100 GPU