The Lustre lab
The Hardware Lab includes multiple Lustre file systems, mostly in production in COSMA, but some experimental.
File system details
/cosma8
A 2.12.6 system comprised of:
4 metadata servers (MDSs)
Operating in HA pairs (manual failover)
Dell PowerEdge servers
Each with 4x 6.4TB NVMe drives (internal)
Two RAID1 pairs
Synced to a corresponding pair using DRBD
One metadata target (MDT) per MDS in normal operation
Up to two during failover
ldiskfs
20 object storage servers (OSSs)
Operating in HA pairs (manual failover)
Dell PowerEdge servers
Each with 168x 16TB drives attached
Two Dell ME484 JBODs, accessed by each server in the pair
7 object storage targets (OSTs) per OSS
Up to 14 during failover
ZFS
/cosma7
A 2.12.6 system comprised of:
2 MDS
Dell PowerEdge servers
Shared Dell Powerstore RAID controller
Providing one MDT per server
ldiskfs
4 OSS
Dell PowerEdge servers
Pair together for manual HA
Each pair accessing a pair of Dell ME5084 RAID controllers
84x 16TB drives in each controller
Providing 2 OSTs per server
ldiskfs
/cosma5
A 2.12.6 system comprised of:
1 MDS/OSS
1 OSS
Dell PowerEdge servers
168 drives shared between them, 12TB
ZFS for OSTs and MDTs
MDTs are RAID1 with no HA
/cosma6
A 2.16.1 system comprised of repurposed hardware from the old /cosma6 storage
1 MDS
Dell PowerEdge servers
And one cold spare
ldiskfs
3 OSSs
Dell PowerEdge servers
And one cold spare
ZFS
3 ME484 JBODs
1 SSD RAID controller for MDTs
A single point of failure
An interesting setup with each server connected to two JBODs, allowing failure of any server or any JBOD.
Care must be taken with multipath labelling when replacing disks in this system.
/snap7
An ultra-fast NVMe-based file system with 2.12.6 and:
1 MDS
Dell PowerEdge servers
1 MDT (ldiskfs)
20 OSS
Dell PowerEdge servers
8 OSTs each (single 3.2TB NVMe disks, ldiskfs)
This file system has no redundancy in the OSTs (if a disk fails, that OST will be lost. Achieving read/write speeds of around 200GByte/s, this is believed to have been the fastest file system in Europe at the time of installation.
/snap8
A similar design to /snap7, again 2.12.6, with:
1 MDS
Dell PowerEdge servers
4 MDTs (each a RAID1 pair of NVMe drives, ldiskfs)
24 OSSs
Dell PowerEdge servers
8 OSTs each ( single 6.4TB NVMe disks, ldiskfs)
Read and write speeds up to around 400GByte/s have been measured.
The /snap file systems are designed with a capacity equal to approximately twice the cluster RAM, to enable two memory snapshots (simulation checkpoints) to be stored simultaneously.
Monitoring
We use a Lustre node exporter and grafana to monitor the Lustre file systems.
The Lustre journey
All systems are self-installed and managed using vanilla opensource Lustre.
Exciting incidents!
Hardware and software sometimes fails. Here we document some of the interesting times we have had with Lustre.
The COSMA8 incident, autumn 2021
A disk failure, which should have been routine, caused a zpool to lock itself. At the time, pacemaker was configured for automatic failover. This then kicked into action since Lustre was unable to write to the frozen pool. When the pool started up on the HA pair, again it couldn’t be written to, and so HA failed over again. This destroyed the zpool, and data was lost.
Fortunately, this was just after commissioning, so the amount of data lost was small and could be compensated for.
We then disabled pacemaker on all our Lustre systems.
MDT Raid controller failuures, January 2026
Two raid failures in the space of two weeks.
The first one was fine: DRBD continued to work in diskless mode, replicating data onto the correctly working server. The fix was fairly simple:
umount the MDT
demote DRBD to secondary
promote DRBD to primary on the HA pair
mount the MDT on the HA pair
run an lfsck
The server was then rebooted, and the raid controller came back to life. This process was then reversed to fail back over.
A raid card in another server then failed less than two weeks over. Upon following the previous recipe, the MDT failed to mount on its HA pair, indicating that the underlying ldiskfs (ext4) filesystem had been corrupted.
An e2fsck was performed which identified and removed some file system inconsistencies.
Again, the MDT failed to mount, with the messages file showing a key message: can't open oi.16.6. This is an internal Lustre file, and must have been corrupt.
The instructions for a backend file system level backup was then partly followed, but primarily this involved, after unmounting all Lustre clients:
Mounting the device as an ldiskfs file system (
mount -t ldiskfs /dev/mdt3 /mnt/tmp).Removing files here:
rm -rf oi.16* lfsck_* LFSCK CATALOGSUnmounting the device
Mounting as a Lustre mdt mount (
mount -t lustre /dev/mdt3 /mnt/tmp).
An lctl lfsck was then performed to repair any inconsistencies. This took around four hours, repairing some layout and namespace elements. We think that it is unlikely that any user data was lost.