GPUs
COSMA has a number of GPU systems, which are available for use. These are:
Direct login (from a login node)
gn001: 10x NVIDIA V100 GPUs
gn002: NVIDIA Grace-Hopper (ARM) system
gn004: NVIDIA H100 GPU on X86 platform
gn005: NVIDIA RTX PRO 6000 GPU on X86 platform
gi001: 2x Intel Ponte Vecchio GPUs (currently dead)
mad06: 0-3x NVIDIA A100 GPUs (1TB RAM)
ga008: AMD MI300A (4 GPUs, 500GB RAM)
cosma8-shm Slurm partition (non-DiRAC allocations)
mad04: 0-3x NVIDIA A100 GPUs (4TB RAM) - composed on request
mad05: 0-3x NVIDIA A100 GPUs (4TB RAM) - composed on request
cosma8-highram Slurm partition (DiRAC allocations)
mad04: 0-3x NVIDIA A100 GPUs (4TB RAM) - composed on request
mad05: 0-3x NVIDIA A100 GPUs (4TB RAM) - composed on request
cosma8-draper Slurm partition (DiRAC allocations for MI300X system)
ga007: 8x AMD MI300 GPUs
cosma8-dine2 Slurm partition (DiRAC allocations for the cosma8-dine2 cluster with A30 GPUs - composable)
gc[001-008]: 0-8x NVIDIA A30 GPUs, 0-4x NVIDIA V100 GPUs
cosma8-shm2 Slurm partition
ga004: 1x AMD MI100 GPU
ga005: 2x AMD MI210 GPUs
ga006: 2x AMD MI210 GPUs
dine2 Slurm partition
gc[001-008]: 0-8x NVIDIA A30 GPUs, 0-4x NVIDIA V100 GPUs
gracehopper Slurm partition
gn003: NVIDIA Grace-Hopper (ARM) system
mi300x, mi300x-prince partition: AMD MI300X system
ga007: 8x AMD MI300 GPUs
Retired
ga003: 6x AMD MI50 GPUs
Project codes
To use the GPUs which are in a Slurm, you must either be part of a DiRAC project with a current allocation on the system, or sign up to the following project codes in SAFE as part of the hardware lab:
do015: dine2 partition
do016: NVIDIA Grace Hopper GPUs, H100, cosma8-shm partition
do017: Intel GPUs (currently dead)
do018: AMD GPUs
DiRAC allocated projects will have higher priority.
To use a GPU available for direct login, just ssh to the node name given above.
GPU stats
GPU |
FP64 TFLOPS |
RAM/GB |
Memory bandwidth/GB/s |
|---|---|---|---|
V100 |
7 |
32 |
900 |
A100 |
9.7 |
40 |
1555 |
A30 |
5.2 |
24 |
933 |
H100 PCIe |
26 |
80 |
2000 |
H100 SMX |
34 |
80 |
3350 |
RTX PRO 6000 |
2 (126 for FP32) |
96 |
1790 |
MI50 |
6.6 |
16 |
1000 |
MI100 |
11.5 |
32 |
1200 |
MI210 |
22.6 |
64 |
1600 |
MI300X |
81.7 |
192 |
5300 |
MI300A |
61.3 |
128 |
5300 |
PVC |
52.4 |
128 |
3280 |
Using the composable A100 GPUs
We have 3 NVIDIA A100 (40GB) GPUs, which can be moved (by software, in seconds, in theory!) between mad04, mad05 and mad06, hence the variable number above. If you have a particular requirement, please contact cosma-support. The default configuration is one GPU each (mad04,05,06). These GPUs are part of a composable PCIe fabric using a Liqid infrastructure funded as part of ExCALIBUR. It is a good idea to add the nvidia-smi command to your batch script so that you can check that the GPUs are present.
You can use the --include or --exclude SLURM parameters within your batch script to specify particular nodes. Or alternatively, to be given a node with a GPU (within the composable partition), you can use #SBATCH --constraint=gpu.
Using the composable A30 and V100 GPUs
The DINE2 cluster has 8 nodes, 8x A30 GPUs and 4x V100 GPUs. The GPUs can be allowed to the nodes as required, depending on user workloads.
You can use the --include or --exclude SLURM parameters within your batch script to specify particular nodes. Or alternatively, to be given a node with a GPU (within the composable partition), you can use #SBATCH --constraint=gpu.
GPU notes
For nodes not assigned to queues (mad06, gn001, gn002, gn004, gn005, ga008, gi001), please be aware that these are shared resources and that other people may be using (or may wish to use) them.
To use some of these GPUs, you may need to be in the “video” or “render” groups (use the id command to check which groups you are in). If you are not in it, but need to be, please ask.
To check that you have the correct permissions to submit to a partition, you can use the scontrol show partition=PARTITION_NAME command to see which groups are allow to submit to that partition.
The Intel PVC GPUs on gi001 are currently dead - sometime in Autumn 2025. Watch this space - we may be able to resurrect them.
DINE2
The DINE2 GPU system is a composable system with up to 8 A30 GPUs per node. Currently these are static (i.e. if you require a specific GPU configuration, please ask), but eventually we hope to make it dynamic (i.e. to be able to ask Slurm to compose a system). In total, there are 8 GPUs, and these can be composed in any configuration to the 8 servers.
Each server has dual 32-core Sapphire Rapids CPUs (64 cores per node) and 2TB RAM.
To use these nodes, submit jobs to the dine2 partition.
Grace-Hopper
One Grace-Hopper node is currently available for direct ssh from a login node (gn002). ga003 is available in a Slurm queue, gracehopper.
Note, the Grace CPU has an ARM architecture, and therefore X86 binaries will not run. The NVIDIA and GCC compilers are available.
Ponte Vecchio
This node is currently dead.
The Intel Ponte Vecchio node (donated by Intel) is available for direct ssh from a login node (gi001).
OneAPI is available using the intel_comp modules, e.g. module load intel_comp (or module load oneAPI).
MI300X
The MI300X node has 8x GPUs, and is available in the mi300x slurm partition, or the cosma8-draper partition for DiRAC allocated resources (higher priority).
To use the GPUs, you will need to specify the number of GPUs required in your Slurm parameters, e.g. #sbatch --gpu=8. By default you will get zero.
The AMD ROCm software stack is installed.
Any codes currently using CUDA will need to be HIP-ified by running the hipify script provided as part of ROCm. Fine tuning may be necessary to optimise performance.
To get interactive access you could use srun -p mi300x -A do018 -t 10 --gpus=8 --pty /bin/bash, and if you want exclusive access to the GPUs (e.g. for benchmarking), use the --exclusive flag. Or srun -p cosma8-draper -A DIRACPROJECT -t 10 --gpus=8 --pty bash.
There have been multiple DiRAC hackathons focused on AMD GPUs, which was very relevant to any users of this system. There will be future hackathons - watch out for them!
MI300A
The MI300A system has 4x GPUs, and is available for direct ssh (to ga008) from a login node.
The AMD ROCm software stack is installed.
Any codes currently using CUDA will need to be HIP-ified by running the hipify script provided as part of ROCm. Fine tuning may be necessary to optimise performance.
The MI300A is an APU: The GPU and CPU are part of the same silicon, and share the same physical RAM. Therefore, memory copies between CPU and GPU are not necessary, which can significantly improve performance for many applications, and make it highly appropriate for some codes which cannot easily be ported to more traditional GPU architectures.
H100
There is a single PCIe-based NVIDIA H100 GPU with an X86 (Intel) host, which is available for direct ssh.
RTX PRO 6000
This is single PCIe-based NVIDIA RTX PRO 6000 GPU with an AMD Turin host, 256 cores, available by direct ssh.