System Usage
Nightly batch and disk use.
Each night the batch queues and filesystems are processed to produce summary information about the usage of the various SLURM clusters and main storage file systems up to that time.
All this can be viewed in the usage web pages found at:
This is where you can find out if your DiRAC project has gone over budget for the current quarter (when this happens your submissions will be automatically demoted to the pauper queue) or how your usage is progressing, as well as who is using the project storage (you can only view your own quotas and those of groups you belong to using the command-line tools). To access this you will need your COSMA username and password.
For simplicity some of this information can be viewed from the command line once logged in using the commands:
# diracusage
and
# diskusage
note you will need the cosma module loaded (this happens by default, unless
you have changed this in your login scripts). They have many options, use the
--help flag to see these.
Seeing the resources used by a running job.
To analyse usage of a node running a current job, you can use Performance Co-Pilot or the cjobload command.
cjobload
Running: cjobload -n 2 JOBID will show the CPU load and RAM for the job given, every 2 seconds, for each node used by the job.
There are also c8jobload, c7jobload, c7rpjobload and c5jobload variants that show value for all the currently running jobs
on these partitions. See the --help of cjobload for additional options.
Performance Co-Pilot
Performance Co-Pilot (PCP) is a system for monitoring the status of computers, and is installed on the compute nodes of COSMA. Some of the information that it monitors (so called metrics) is available for remote access, so you can use it to check things like the load and memory use of any node, without any need for special privilege (logging into compute nodes is not allowed, unless you are a member of the developers group).
There are a lot of commands that PCP offers, somewhat overwhelming at first glance (and documentation is usually focused on system configuration), so we have a couple of scripts that should show things that are useful when determining if a job is using the compute resources as expected. So they show the activity of the CPUs and memory use:
# pcpslurmnode
# pcpslurmjobs
See the --help output for more options, but for instance if you know the
node m8001 is running your job you can do:
# pcpslurmnode m8001
#- Node: m8001, Cpus: 256, Tasks: 3326 total, 129 running, 51 sleeping
#- Memory: 1006.92 total, 360.25 free, Load averages: 128.67,128.58,128.49
# JobID CPUs NodeCPU% Mem NodeMem%
# ----- ---- -------- --- --------
9175206 128.32 50.13 625.37 62.11
Interesting numbers here, other the obvious use of CPUs and memory are the number of processes and the 1, 5 and 15 minute load averages, these are all suggesting that the correct numbers are being used, i.e. 128. Higher loads usually indicate that too many processes or threads are being used.
You can also give a range of nodes or more node names, but to see all the nodes for a job use:
# pcpslurmjobs 9175206
# Node usage by jobs (for detailed usage use pcpslurmnode):
# Node JobID CPUs MEM
# ---- ----- ---- ---
m8146 9175206 127.40 625.123
m8075 9175206 127.31 625.477
m8266 9175206 127.38 623.506
m8152 9175206 127.27 625.148
m8145 9175206 127.37 625.218
m8156 9175206 127.38 623.514
m8279 9175206 127.35 625.228
m8330 9175206 127.37 623.851
m8252 9175206 127.38 623.873
m8001 9175206 127.36 625.381
m8086 9175206 127.44 625.404
m8155 9175206 127.32 624.141
m8288 9175206 127.37 624.376
m8044 9175206 127.40 625.231
m8267 9175206 127.40 624.737
m8278 9175206 127.33 624.779
m8251 9175206 127.36 622.996
m8167 9175206 127.33 623.896
m8277 9175206 127.43 624.720
m8265 9175206 127.38 623.966
m8111 9175206 127.40 625.391
m8280 9175206 127.42 623.801
m8161 9175206 127.35 623.892
m8154 9175206 127.37 624.156
m8169 9175206 127.40 624.374
# Summary of job usage:
# JobID CPUs CPUsNode CPUsEffic MemTotal MemNode MemEffic NumNodes
# ----- ---- -------- --------- -------- ------- -------- --------
9175206 3184.26 127.37 99.51 15612.18 624.49 62.08 25
So you can see this job is probably using an MPI rank per core and has a high memory use.
The following descriptions offer a somewhat arbitrary number of other, hopefully, useful commands.
Machine Overview
To get a more detailed overview of memory and cpu usage on a compute node, use the command pmstat with the -h option to query:
# pmstat -h m7014
loadavg memory swap io system cpu
1 min swpd free buff cache pi po bi bo in cs us sy id
27.00 0 478g 177040 2249m 0 0 0 0 29K 2924 47 1 52
27.00 0 478g 177040 2249m 0 0 0 0 28K 1831 47 1 52
This is updated every 5 seconds until interrupted. You can change the update interval to get a faster or slower rate using the -t flag:
# pmstat -h m7014 -t 1min
# pmstat -h m7014 -t .5sec
You can also query more than one node at a time:
# pmstat -h m7014 -h m7250
node loadavg memory swap io system cpu
1 min swpd buff cache pi po bi bo in cs us sy id
m7014.p 27.00 0 478g 2249m 0 0 0 0 28K 742 47 1 52
m7250.p 28.00 0 442g 20604m 0 0 0 0 29K 1262 49 1 50
m7014.p 27.00 0 478g 2249m 0 0 0 11 29K 3774 47 1 52
m7250.p 28.00 0 442g 20604m 0 0 0 25 29K 1250 49 1 50
m7014.p 27.00 0 478g 2249m 0 0 0 0 28K 1793 47 1 52
m7250.p 28.00 0 442g 20604m 0 0 0 20 29K 1146 49 1 50
but that seems to have bug (buff should obviously be free).
The pmstat man page describes the various fields, but loadavg is probably the first thing to look at, it is roughly the number of runnable processes (not quite, for more details see the uptime man page). In general this should not exceed the number of physical cores on the machine, so an optimal load for COSMA7 is 28 and for COSMA8 128. If your code uses hyper-threads then these numbers may double. In this case we can see a job using fewer cores, this may be necessary to achieve the memory use per core. Hybrid codes will also not in general show loads equal to the numbers of cores as they will not be continually runnable (it would be nice if they were, however don’t be fooled into thinking that is bad, most MPI codes that use a single rank per core are also not continually busy, they are usually just spinning the CPU waiting for communication).
You can get the 1min, 5min and 15min loads using the pcp dstat sub-commands:
# pcp -h m7001 dstat -l
---load-avg---
1m 5m 15m
27.9 27.8 27.7
27.9 27.8 27.7
27.9 27.8 27.7
27.9 27.8 27.7
dstat also has options to report cpu and memory use as well. See the pcp-dstat man page.
Detailed Memory Use
The memory profile of the whole machine is important when understanding if your code is making optimal use of the memory. There are various commands that report memory use, but a good one is:
# pmrep -h m7420 :sar-r
kbmemfree kbmemavai kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
17:44:30 337693080 386612632 190087944 36.02 128736 49919624 127783768 23.84 146485992 29643324 188
17:44:31 337693080 386612632 190087944 36.02 128736 49919624 127783768 23.84 146485992 29643324 188
17:44:32 337693080 386612632 190087944 36.02 128736 49919624 127783768 23.84 146485992 29643324 188
17:44:33 337685304 386610904 190095720 36.02 128736 49926392 127783768 23.84 146488008 29647356 10716
This is a clone of the output from the sar -r command, and reports much of the node memory information from the /proc/meminfo file. The pmrep command, like pcp dstat has a large number of possible reports, these defined in the file /etc/pcp/pmrep/pmrep.conf
NUMA Balance
A well behaved batch job will distribute itself over the all the cores and sockets of the machine. A classic issue is sharing a single core with more than one pinned thread, so checking that both the CPUs and cores are loaded as expected can be a good idea. The socket balance can be seen using:
# pmrep -h m7400 :numa-per-node-cpu
NUMA n %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle
17:47:09 node0 49.97 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 50.01
17:47:09 node1 50.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 50.01
17:47:10 node0 50.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 49.92
17:47:10 node1 49.92 0.00 0.14 0.00 0.00 0.00 0.00 0.00 0.00 50.00
In this case the balance is good, each NUMA node has the same work and all the cores are in use as 50% equals all the physical cores – the other 50% would be used if hyper-threading was required. Note that nodes in this case are a equivalent to a socket holding a CPU, some architectures allow CPUs to be split into more NUMA regions (COSMA8 has 4 per socket), usually depending on how the memory is accessed.
The activity per core can be seen using:
# pmrep -h m7400 :mpstat-P-ALL
the output can be very long (256 lines for COSMA8), but looking closely you can see which CPUs are active. If you have pinning enabled these should remain fixed.
Infiniband Use
The only IO of interest on a typical node is how much traffic is being generated on the infiniband fabric, there are potentially a lot of metrics that can be reported about this, but the command:
# pmrep -h m7400 infiniband.port.total.bytes -t 3s -b MB
will report the total megabytes per second averaged over 3 seconds.
Visualization
The metrics that can be reported from a node can be seen using the command:
# pminfo -h m7400
each of these can be queried using the pmval or pmrep commands, they can also be visualized in a graph using the pmchart command. This is very flexible and can be used to see the metrics of more than one node at a time, as in:
# pmchart -h m7001 m7002 m7003
you can then select the metrics of interest from the nodes for comparison and can also capture the information. There is fuller help at here
The CPU value defines a view which are some predefined plots. Other useful views are Memory and Overview, but if using Overview increase the vertical size to 1000. On COSMA the infiniband read and write rates are also available in the IB view.
Final Words
As was noted at the beginning, and if you have looked at the output of pminfo, PCP will allow you to monitor a lot of details of the compute nodes, but there are some limitations. The metrics that make reports at the process level require privilege that is not available using remote access, so if you try them they will fail (a shame as pcp atop, would be very nice, like a remote top command). All metrics are available locally, just leave the -h part out.
It is also not possible to see any archival material, so you need to check things when your job is running.