COSMA job queues

A list of available queues can be found using the sinfo -a -s command. Partitions are summarised here:

Partition	Nodes	Node details	GPU info	Network	Comments
cosma8	528, m8001-m8528	128 cores, 1TB	No	HDR 200g	Variants: -rome, -milan, -prince, -pauper, -serial
cosma7	220, m7229-m7448	28 cores, 512GB	No	EDR 100g	Variants: -prince, -pauper
cosma7-rp	222, m7001-m7222	28 cores, 512GB	No	Rockport 100g	Variants: -prince, -pauper
cosma8-shm	2, mad04,05	128 cores, 4TB	Up to 3x A100	HDR 200g	Composible fabric
cosma8-shm2	2, ga004,06	128,64 cores, 1TB	MI100, 2x MI210	HDR 200g	AMD GPUs
cosma8-shm3	2, mad08,09	192 cores, 768GB	No	HDR 200g	AMD Genoa
cosma8-ska	1, mad07	56 cores, 4TB	No	HDR 200g	NVDIMMs
cosma8-ska2	4, mad01,03,04,05	56, 48, 128, 128 cores, 3, 6, 4, 4TB	3x A100	EDR/HDR	mad04,05 composible fabric
cosma8-highram	See cosma8-shm
cosma8-draper	1, ga007	96 cores, 2TB	8x MI300X	HDR200	AMD GPU
cosma8-dine2	8, gc001-gc008	64 cores, 2TB	A30, V100	HDR200	Composible fabric
cosma7-mad	1, mad01	56 cores, 3TB	No	EDR 100g
cosma7-shm	2, mad01,02	56, 112 cores, 3, 1.5TB	No	EDR 100g
cosma7-shm2	1, mad03	48 cores, 6TB	No EDR 100g	NVDIMMs
cosma5	8, m5001-m5008	256 cores, 1.5TB	No	EDR 100g	Replacement of original COSMA5
cosma-analyse	3, mad01,02,03	56,112,48 cores, 3,1.5,6TB	No	EDR 100g
bluefield1	18, b101-b124 with gaps	32 cores, 512G	No	HDR200	Including Bluefield2 DPUs
dine	See bluefield1
dine2	See cosma8-dine2
gracehopper	1, gn003	72 cores, 512GB	H100 GPU	HDR200	Arm CPU
mi300x	See cosma8-draper

Nodes available for direct SSH (from a login node), for prototyping etc:

gn002 - Grace Hopper, cf gracehopper parttion
mad06 - 128 cores, 1TB RAM, Milan-X (large cache), part of the cosma8-shm composible fabric with up to 3 A100 GPUs
ga008 - 96 cores, 500G, 4x AMD MI300A GPUs

COSMA8 queues

There are a number of COSMA8 queues: cosma8, cosma8-prince, cosma8-pauper, cosma8-serial, cosma8-shm, cosma8-shm2, cosma8-shm3, cosma8-draper, cosma8-dine2, cosma8-milan, cosma8-rome, cosma8-highram (and a few others)

If you are unsure, submit to cosma8 (assuming yuo have access - see below).

Because COSMA8 contains two generations of processors (Rome and Milan), if you want to only submit to one of those (there is a 10-20% performance difference between them), you can submit to the specific cosma8-milan or cosma8-rome partitions.

If your job does not require a full node (of 128 cores), you can submit to the cosma8-serial queue.

If you have a valid reason for a longer running time (than 3 days), or for higher priority, you can ask to be able to submit to the -prince queues. The prince queues allow 30 day run times, and can be used by large jobs to avoid the need for reservations and increase system efficiency.

The -pauper queues are for low priority jobs, primarily for projects who have used up their allocation (who will automatically be demoted to this queue). They are limited to a 24 hour runtime.

Other COSMA8 queues are smaller specialised compute partitions, mentioned in more detail below.

Access to COSMA8

You can use the scontrol show partition=PARTITION to show details about each queue, and work out which groups/projects you would need to be in to submit jobs to them.

Access to the DiRAC COSMA8 system is only for DiRAC projects with a time allocation on this system. Each COSMA8 compute node has 128 cores and 1TB RAM. If you request fewer than 128 cores for a job using the cosma8-serial queue, you will not be granted exclusive access to a node, and therefore may share a node with other jobs. Unless you specify a memory in your SLURM script (e.g. bash #SBATCH --mem=800G), then you will be given a fraction of 1TB equal to the number of cores you request/128. If you do not wish to share a node, use the “#SBATCH –exclusive” flag (or don’t submit to the cosma8-serial queue). The cosma8-shm queue provides access to two large 4TB nodes, each with 128 cores.

cosma8-shm, cosma8-shm2, cosma8-shm3

The cosma8-shm and cosma8-shm2 queues are small queues with non-standard hardware.

cosma8-shm contains:

mad04: 128 cores, 4TB RAM, 0-3 NVIDIA A100 GPUs, AMD Rome processors
mad05: 128 cores, 4TB RAM, 0-3 NVIDIA A100 GPUs, AMD Rome processors

If also used to include mad06: Milan-X architecture, 128 cores, 1TB RAM, large (768MB) L3 cache, with 0-3 NVIDIA A100 GPUs, though this is now a more general purpose node for direct ssh from a login node.

The A100 GPUs are “composable”, that is, they can be allocated to different servers upon demand using a Liqid fabric. By default, mad04 and mad05 contain one GPU each (as does login8b). However, should more than 1 GPU be required for a particular workload this can be arranged (at the expense of moving them from the other nodes)

cosma8-shm2 contains:

ga004: 128 cores, 1TB RAM, AMD MI100 GPU, AMD Rome processors
ga005: 64 cores, 1TB RAM, 2x AMD MI200 GPU, AMD Rome processors
ga006: 64 cores, 1TB RAM, 2x AMD MI200 GPU, AMD Rome processors

Should you wish to submit to a particular node, or exclude a particular node (e.g. to ensure that you have access to a particular type of GPU), you can use the bash #SBATCH --exclude=ga004 directive within your batch script (which in this case would exclude ga004, thus ensuring you have access to a node with 2x AMD MI200 GPUs. Or #SBATCH –nodelist=ga004 (which in this case would submit to only the ga004 node).

cosma8-shm3 contains:

mad08: 192 cores, 768GB RAM, Genoa processors
mad09: 192 cores, 768GB RAM, Genoa processors

cosma8-highram

The cosma8-highram queue uses the same nodes as the cosma8-shm queue, but is for DiRAC projects.

COSMA7 queues: cosma7, cosma7-rp, -pauper and -prince

Access to the DiRAC COSMA7 machine is provided using the two queues cosma7 and cosma7-rp (and their increased or reduced priority versions, cosma7-pauper, cosma7-prince, cosma7-rp-pauper and cosma7-rp-prince). These two queues provide access to half of COSMA7, with 224 nodes in each. cosma7 uses a 100Gbit/s InfiniBand fabric, while cosma7-rp uses a 100Gbit/s Rockport Ethernet fabric (6D torus topology).

All the queues are configured so that job exclusive access to nodes is enforced. This means that no jobs share a compute node. Therefore, if you only need a single core, your project allocation will still be charged for using 28 cores.

The main use of these queues is for DiRAC projects that have been assigned time at Durham and the mix of jobs expected to match its capabilities are MPI/OpenMP/Hybrid jobs using up to 28 cores per node with a maximum memory of 512GB per node. If any jobs can run using fewer resources than a single node then they should be packaged into a batch job with appropriate internal process control to scale up to this level.

In addition to the hardware limits the queues have the following limits and priorities:

Name	Priority	Maximum run time	Maximum cores
cosma7 and cosma7-rp	Normal	72 hours	unlimited
-pauper	Low	24 hours	unlimited
-prince	Highest	30 days	4096

The pauper and prince variations of the queues share the same resources so the order that jobs run are decided on a number of factors. Higher priority jobs will run first, and in fact jobs in higher priority queues will always run before lower priority jobs, however, it may not superficially seem like that as jobs from lower priority queues may run as back-fills (this is allowed when a lower priority job will complete before the resources needed for a higher one will become available, so setting a run-time limit for your job may get it completed more quickly). See the FAQ for how to make use of back-filling.

COSMA6 queues: cosma6, cosma6-pauper and cosma6-prince

COSMA6 was retired in April 2023 after 11 years of service.

COSMA5 queues: cosma5, cosma-test

Access to the Durham COSMA5 machine is provided using the cosma5 queue. This queue is comprised of the new (2024, 2025) COSMA5 nodes, a total of 8 nodes each with 256 cores and 1.5TB RAM. This partition is non-exclusive, i.e. jobs will share a node with other jobs unless exclusive access is explicitly requested. You are therefore advised to explicitly request the amount of RAM your job will use, otherwise a default of around 6GB per core will be allocated.

The old COSMA5 nodes were retired in 2025, originally starting service in 2012 (16 cores per node, 128GB RAM). There were originally 420 nodes, 6720 cores.

The main use of this queue is for Durham projects that have been assigned time at Durham and the mix of jobs expected to match its capabilities are MPI/OpenMP/Hybrid jobs using up to 256 cores per node with a maximum memory of 1.5TB per node.

In addition to the hardware limits the queues have the following limits and priorities:

Name	Priority	Maximum run time	Maximum cores	Cores per node	nodes	RAM per node
cosma5	Normal	72 hours	unlimited	256	8	1.5TB

The cosma5-test queue is for testing purposes only.

Higher priority jobs will run first, though lower priority jobs with shorter runtimes may run as back-fill jobs, making use of available nodes before a larger job is due to start. Setting a run-time limit for your job may get it completed more quickly. See the FAQ descriptions for how to make use of back-filling.

The quarterly allocation can can be found out in the COSMA usage pages, you’ll need your COSMA username and password to see these. However, COSMA5 is no longer listed here as no DiRAC allocations are awarded on it.

Jobs within the same queue are scheduled using a fairshare arrangement so each user initially has the same priority. This is then weighted using a resources used formula.

Note that the order of running can again be affected by back-filling (but that will only work if a job is given a run time) and using fewer resources than other jobs.

cosma8-ska, cosma8-ska2 queues

The cosma8-ska and cosma8-ska2 queues are for SKA projects.

cosma8-ska is a single node (mad07) with 4TB RAM and 128 cores.

cosma8-ska2 has 4 high RAM nodes (mad01, mad03, mad04, mad05), with 3, 6, 4 and 4TB RAM respectively, and between 56-128 cores.

cosma8-draper and mi300x queues

The cosma8-draper and mi300x queues both contain the same single node, an 8-GPU system containing 8x MI300X GPUs.

Direct access to an MI300A system (4x GPUs) is available via direct ssh to ga008 from a login node.

The cosma8-draper queue is named in honour of Peter Draper, who retired from COSMA technical support after many years in 2026.

cosma8-dine2, dine2

The cosma8-dine2 and dine2 partitions contain 8 nodes each with 2TB RAM and 64 cores (Intel Sapphire Rapids). Each node also contains between 0-12 GPUs: This is a composible system where GPUs can be allocated to nodes upon request. There are a total of 8x A30 GPUs and 4x V100 gPUs.

cosma7-mad

A single node, mad01, with 56 cores and 3TB RAM

cosma7-shm

Two nodes, mad01, mad02, with 56 and 3TB RAM and 112 cores and 1.5TB RAM respectively.

cosma7-shm2

A single node, mad03, with 48 cores and 6TB RAM

cosma-analyse

A partition for analysis of cosmological datasets, including mad01 (56 cores, 3TB), mad02 (112 cores, 1.5TB), mad03 (48 cores, 6TB).

bluefield1, dine

Both are 18 node (was 24) partitions with each node having 32 cores and 512GB RAM. This includes gitlab-runners, and so can be used for CI/CD. The nodes are b[101-105,107-115, 116-119]

gracehopper

A single Grace-Hopper node, gn003. The gn002 node is also Grace-Hopper, available for direct ssh from a login node for code testing, development, etc.

Other queues

There are a number of other queues which may be of relevance. If in doubt, please ask.