How to monitor Slurm jobs
How to monitor Slurm jobs
Information on all running and pending batch jobs managed by Slurm can be
obtained from the Slurm command squeue. Note that information on completed
jobs is only retained for a limited period. Information on jobs that ran in
the past is via sacct. An example of the output squeue is shown below.
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18957 short-ser mean user1 R 0:01 1 host147
18956 short-ser calc user2 R 48:38 1 host146
18967 test wrap user1 R 14:25 1 host146where the field ST is the job state and the TIME is the time used by the
job.
A batch job evolves in several states in the course of its execution. The typical job states are defined in Table 1
Table 1: Job states
| Symbol | Job state | Description |
|---|---|---|
| PD | Pending | The job is waiting in a queue for allocation of resources |
| R | Running | The job currently is allocated to a node and is running |
| CG | Completing | The job is finishing but some processes are still active |
| CD | Completed | The job has completed successfully |
| F | Failed | Failed with non-zero exit value |
| TO | Terminated | Job terminated by Slurm after reaching its runtime limit |
| S | Suspended | A running job has been stopped with its resources released to other jobs |
| ST | Stopped | A running job has been stopped with its resources retained |
A list of the most commonly used commands and their options for monitoring batch jobs are listed in Table 2, below:
Table 2. List of important Slurm commands and their options for monitoring jobs
| Slurm Command | Description |
|---|---|
squeue |
To view information for all jobs running and pending on the cluster |
squeue --user=username |
Displays running and pending jobs per individual user |
squeue --states=PD |
Displays information for pending jobs (PD state) and their reasons |
squeues --states=all |
Shows a summary of the number of jobs in different states |
scontrol show job JOBID |
Shows detailed information about your job (JOBID = job number) by searching the current event log file |
sacct -b |
Shows a brief listing of past jobs |
sacct -l -j JOBID |
Shows detailed historical job information of a past job with jobID |
An example of the job output file from a simple job submitted to Slurm:
sbatch -p test --wrap="sleep 2m"
Submitted batch job 18973scontrol show job 18973
JobId=18973 JobName=wrap
UserId=fchami(26458) GroupId=users(26030) MCS_label=N/A
Priority=1 Nice=0 Account=jasmin QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:08 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2020-05-20T14:10:28 EligibleTime=2020-05-20T14:10:28
AccrueTime=2020-05-20T14:10:28
StartTime=2020-05-20T14:10:32 EndTime=2020-05-20T15:10:32 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-05-20T14:10:32
Partition=test AllocNode:Sid=sci2-test:18286
ReqNodeList=(null) ExcNodeList=(null)
NodeList=host147
BatchHost=host147
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=128890M,node=1,billing=1
Socks/Node=*NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=128890M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/users/fchami
StdErr=/home/users/fchami/slurm-18973.out
StdIn=/dev/null
StdOut=/home/users/fchami/slurm-18973.out
Power=sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
18963 wrap par-single jasmin 1 COMPLETED 0:0
18964 wrap short-ser+ jasmin 1 COMPLETED 0:0
18965 wrap par-single jasmin 1 COMPLETED 0:0
18966 wrap short-ser+ jasmin 1 COMPLETED 0:0