Pauper Queues
To complement the main scheduling system, COSMA also includes pauper queues. These are lower-priority queues with shorter maximum runtimes. Projects that have exceeded their quarterly allocation are automatically directed to use these queues.
The purpose of the pauper queues is to allow continued access to the system while ensuring fair resource distribution. Jobs submitted to these queues will only run when there are otherwise idle resources available, helping to maintain high overall system efficiency without disrupting higher-priority workloads.
Pauper access is also granted during the first quarter after a project has ended – that is, when it no longer has an active allocation. This grace period is intended to allow the completion of any remaining work. Without this, projects might be forced to abandon or pause their work until new resources are awarded (which could lead to the same outcome - abandonment). However, this access is not intended for starting new work. Philosophically it also reflects that working to artificial deadlines can be difficult.
Backfilling
All COSMA queues support backfilling, a key feature for improving system efficiency. Backfilling allows smaller or shorter jobs to be run out of order if they can fit into gaps in the schedule without delaying higher-priority jobs.
This is possible because longer-running jobs often hold future reservations on resources due to their priority and requirements, leaving those resources temporarily idle.
For backfilling to work effectively, it is essential that users provide realistic runtime estimates when submitting jobs. If all jobs are submitted with identical or overly conservative runtimes, the scheduler cannot make informed decisions, and the system would behave more like a simple priority queue, reducing overall efficiency.
Backfilling is also an excellent tool for developers, as it allows short jobs to run quickly helping to speed up the development and testing cycle.
Assumptions for Good Outcomes in This System
While assigning equal shares to all users may seem mismatched with the reality that projects have differing allocations, this approach has worked effectively for many years. Most projects have been able to use their allocations successfully. Occasional complaints typically stem from misunderstandings about how the system operates, rather than from systemic issues.
A key underlying assumption is that projects with large allocations also have a large number of active users. This ensures that resources are consistently in use – either through many users submitting jobs or through the regular submission of large jobs. Until recently, this assumption has generally held true.
Another important factor is the presence of some idle capacity in the system. A certain level of unused resources is necessary to maintain responsiveness and flexibility, allowing the scheduler to accommodate jobs efficiently and avoid bottlenecks.