Slurm - In depth

Notice

This guide assumes you have read and setup your environments as per the Using Slurm and Spack documentation.

Resource Requests

Slurm provides several different options of allocating CPU's/threads. Below are some scenarios and the appropriate allocations.

CPU request scenarios

Single-threaded process requires 1 thread on any node: --ntasks=1 --cpus=per-task=1
Multi-threaded process requires 16 threads on any node: --ntasks=1 --cpus-per-task=16
Multi-threaded process requires 16 threads on a node with exclusivity (use only if undoubtedly needed): --ntasks=1 --cpus-per-task=16 --exclusive
16 concurrent but isolated tasks, each task with 1 thread (no multithreading): --ntasks=16 --nodes=1
16 concurrent but isolated tasks, each task with 2 threads (multithreading only within each single task): --ntasks=16 --nodes=1 --cpus-per-task=2

The scenarios' options have the following implications

cpus-per-task: Amount of threads for each task. Supports multithreading and intratask communication.
ntasks: Amount of isolated threads. Can not communicate with any other task, including tasks in the same job submission.
nodes: Amount of nodes in which to split tasks between. Should only be > 1 if utilizing MPI or Job Steps.

Unless using MPI, slurm "job steps", or isolated concurrently running tasks: ntasks should remain = 1 and --nodes should not be used. The "--cpus-per-task" option should be used for thread allocations. For resource allocations with MPI, see: MPI with Slurm.

Memory request scenarios

If you have run the program before on SCU resources you can view the job's memory statistics. From there, you can view the job's Maximum used memory and request a small amount over it for your next iteration of the job. See: Requesting appropriate amounts of resources
If you have never run the program before; refer to the program's documentation for recommended memory requirements. If none exist, start with 8GB and increase if necessary.

Requesting appropriate amounts of resources

Slurm allocates resources based on the listed hardware requirements in your srun command or sbatch script. It is imperative that users neither exaggerate nor underestimate their resource requirements. Slurm provides tools and resources to help users understand their jobs' resource requirements. See Consequences of requesting inappropriate resources.

Memory

RAM is the SCU's most consumed resource. Therefor, the SCU recommends that your jobs request memory no more than 4GB or 20% greater than your previous jobs' maximum used memory (whichever is higher). This command can be used to query your completed jobs' memory allocation vs their maximum memory use in the past 7 days (modify date field as desired):

Allocated RAM vs Maximum used RAM

sacct -S $(date -d "-7 days" +%D) --state=CD -o "user,JobID,JobName,ReqMem,MaxRSS,state,exit"

eg. If the maximum memory used by the previous iteration of your job was 12GB, you should request a maximum of 16GB of RAM (12GB + 4GB > 12GB x 1.2). To query currently running jobs, use sstat.

Compute

For non-parallelizable or non-multithreading programs, it is best to keep the thread count (--cpus-per-task) to 1. See: Amdahl's Law.

For parallelizable or multithread-supporting programs, many provide options that allow for the specification of the amount of threads to utilize. If the option exists, specify the same amount of threads in your srun or sbatch script that you'll have listed in your program command.

For programs that do not provide this capability out of the box or require the addition of code, provide a fair estimation of the amount of threads needed (it is best to specify threads to powers of 2). If you cannot provide a fair estimation of needed compute, you can either run the program in an srun interactive session while monitoring compute usage, or view the CPU usages of your program after it's completion.
If a program by default uses all threads on a machine, you can allocate any number of threads in your srun or sbatch script. Slurm isolates the threads of each job, so your program would be limited to the amount of threads you list in your sbatch script.

Allocated CPU time vs Used CPU time

sacct -S $(date -d "-7 days" +%D) --state=CD -o "user,JobID,JobName,cputime,avecpu,ncpus,state,exit"

In the output of the above command, if cputime (column 4) is orders of magnitude greater than avecpu (column 5), then there were allocated CPU's idle and you should decrease your CPU allocation accordingly. Column 6 is the amount of threads that was requested.

eg. This single-threaded program was run with 16 slurm-allocated threads. The CPUtime was ~11 minutes while the AveCPU was 28 seconds. This is an example of an overallocation of CPU threads

   rahmed 511232           sbatch   00:10:56                    16  COMPLETED      0:0 
          511232.batch      batch   00:10:56   00:00:28         16  COMPLETED      0:0

When allocated with 1 thread rather than 16, the CPUtime decreased to 41 seconds, while the AveCPU remained the same at 28 seconds.

   rahmed 511211           sbatch   00:00:41                     1  COMPLETED      0:0 
          511211.batch      batch   00:00:41   00:00:28          1  COMPLETED      0:0

Consequences of requesting inappropriate resources

Slurm allocates and isolates your requested resources. Therefor, if resources are either overestimated, or underestimated, it could detriment the timeliness of your workflows and others'.

Unused resources are resources that could have been used to run other jobs, including yours. If your job requests 100GB of RAM but only uses 12GB, the unused 88GB will have been made unavailable to all other jobs in the queue; slowing down the cluster's churn rate, and causing your jobs and others' to lag behind.
Unused resources are counted towards your FairShare allocation, lowering your priority for your future jobs. See FairShare.
Underestimated resources may cause slurm to kill your jobs. So it is important to neither underestimate nor greatly overestimate the resources needed. Slurm will only kill jobs that exceed their memory requests. Threads are individually allocated and isolated so your jobs cannot exceed their allocated thread count.

MPI with Slurm

When utilizing MPI, resource requests in Slurm should be allocated differently. Slurm works well with MPI and other parallel environments and as such, srun can be called directly to run mpi programs (see sample script below).

Note: Code and programs must be compiled with explicit support for MPI to utilize MPI capabilities. Also, this does not apply to relion MPI calculations. For SCU documentation on Relion, see: Relion

CPU request scenarios - MPI

8 MPI processes that are unconcerned with node placement: --ntasks=8
8 MPI processes to be split amongst 4 nodes: --ntasks=8 --ntasks-per-node=2 OR --ntasks=8 --nodes=4 (Both options are functionally identical)
8 MPI processes to be split amongst 8 nodes: --ntasks=8 --ntasks-per-node=1 OR --ntasks=8 --nodes=8
8 MPI processes to be split amongst 4 nodes with 2 allocated threads per process: --ntasks=8 --ntasks-per-node=2 --cpus=per-task=2. (Total allocated threads would be equal to 16, with each node reserving 4 threads split amongst 2 tasks)

Sample MPI script

In the below script, we will be requesting 8 tasks to be split amongst 4 nodes. Script assumes you have setup both your Spack and Slurm environments.

Sample_MPI_sbatch.sh

#SBATCH --job-name=mpi_expl
#SBATCH --ntasks=8
#SBATCH --nodes=4
#SBATCH --mem=64G
#SBATCH --output=mpi1.out
#SBATCH --partition=panda


InputData=/athena/mpi/data
OutputData=/athena/mpi/outputData


spack load openmpi@3.0.0 schedulers=slurm ~cuda

#Copy data to TMPDIR if program is read or write intensive
rsync -a $InputData $TMPDIR

#slurm integrates well with mpi, so srun can handle the allocation of tasks, threads, nodes, and the calling of mpirun.
srun mpi_prog

#Copy data back to athena
rsync -a $TMPDIR $OutputData

Job monitoring

In addition to the methods found in the Using Slurm documentation, there are other commands which can be used to query job status, history, accounting data, etc

Status

In the Using Slurm documentation, the commands squeue -u <user> as well as squeue_long -u <user> are shared and explained. In addition to these there are several other commands slurm supports.

To view details of a job submission, this command can be used: scontrol show jobid #job_id

Job details

scontrol show jobid 595278
JobId=595278 JobName=sbatch
   UserId=rahmed(8992) GroupId=pbtech(1053) MCS_label=N/A
   Priority=4140804828 Nice=0 Account=scu QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:09 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2019-05-16T17:05:07 EligibleTime=2019-05-16T17:05:07
   StartTime=2019-05-16T17:05:08 EndTime=2019-05-23T17:05:08 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=panda AllocNode:Sid=curie:11574
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node140
   BatchHost=node140
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1000M,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pbtech_mounts/homes024/rahmed/sbatch
   WorkDir=/pbtech_mounts/homes024/rahmed
   StdErr=/pbtech_mounts/homes024/rahmed/slurm-595278.out
   StdIn=/dev/null
   StdOut=/pbtech_mounts/homes024/rahmed/slurm-595278.out
   Power=

The above command displays several useful entries. Including the assigned node, resource configurations that slurm understood from the sbatch scripts, path to the sbatch script, stdout, etc.

History

To view statistics for your previously running jobs, sacct should be used. Modify the # of days as desired. The following command displays all jobs run over the past week

Jobs in the past N days

sacct -S $(date -d "-7 days" +%D) -o "user,JobID,JobName,ReqMem,MaxRSS,NCPUS,start,end,node,state,exit"

To view jobs between certain dates (more options are available for sacct. run "man sacct".):

Jobs between certain dates

sacct -S 04/01/19 -E 04/31/19 -o "user,JobID,JobName,ReqMem,MaxRSS,NCPUS,start,end,node,state,exit"

For CPU and Memory statistics, see Requesting appropriate amounts of resources

FairShare

Slurm's "FairShare" algorithm regulates cluster scheduling and prioritization to ensure each lab is able to utilize cluster resources fairly and equitably. The two main factors considered in slurm's calculation of a lab's fairshare is the lab's cluster usage and its amount of priority shares.

Priority shares

A large factor in Slurm's FairShare calculation is a lab's amount of priority shares. Slurm uses priority shares to identify labs' expected share of compute on the cluster. A lab's amount of priority shares is equal to their total amount of leased Athena storage x 100. (10T of scratch + 10T of store = 2000 priority shares). This amount of priority shares would be divided by the total amount of priority shares, the result being the lab's expected share of the cluster. Slurm would then prioritize that lab's jobs to use their compute shares until their share of the cluster has been utlized.

eg. A lab has procured 50T of storage on athena, which translates to 5000 priority shares. As of when this document was written, slurm would grant this lab priority up to its allocated ~3% of the cluster's use. The lab's jobs would be prioritized at first, but would be deprioritized as jobs run, until it has consumed its allocated 3 percent effective usage.

Cluster usage

Cluster usage inversely impacts your fairshare and thus your job priority. As your usage increases, your fairshare and priority decrease. Slurm tracks cluster usage based on the requested Trackable Resources (TRES). TRES currently records requested CPU and RAM allocations, and multiplies that by the runtime of the job. The product being the job's resource allocation, which gets applied towards their allocated percent of cluster use.

FairShare score

Fairshare is the resulting score after Slurm calculates the aforementioned factors. Fairshare scores are the deciding factor when it comes to job prioritization and are represented from 0 to 1. 0 being the lowest priority and 1 being the highest. Below are fairshare scores and their implications.

A fairshare score of 1: This lab has not run any jobs. It's jobs will be prioritized over all others.
A fairshare score 0.5 < n < 1: This lab has not utilized it's expected share of cluster usage. It's jobs will also be prioritized over all jobs with a lower fairshare.
A fairshare score 0.0 < n < 0.5: This lab has overutilized the cluster according to it's expected share of cluster usage. It's jobs will be deprioritized
A fairshare score of 0: This lab has greatly overutilized their share of the cluster. Its jobs will be deprioritized and will run only when all other prioritized jobs have begun to run.

FairShare query

The sshare command can be run from curie or any slurm node to display fairshare statistics for all labs.

sshare

sshare
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 
root                                          1.000000   694362950      1.000000   0.500000 
 abc                                  2000    0.011481          28      0.000000   0.999998 
 accardi_lab                          3000    0.017221    45166247      0.065072   0.072870 
 aksaylab                              100    0.000574           0      0.000000   1.000000 
 angsd_class                           300    0.001722       19770      0.000028   0.988597 
...

In the output, we see the two factors that are considered in fairshare calculation, cluster usage and priority shares.

Reviewing the output by column, we can analyze the values of a lab's shares and usage as well as how slurm takes them into account when calculating fairshare.

Account: Name of lab in Slurm's database.
User: When using sshare -a, slurm displays the fairshare calculation for all users in all labs. This can be used for more thorough accounting.
RawShares: The amount of shares a lab has been allotted. This is reached by calculating total athena storage * 100 = RawShares
NormShares: This is the percentage of a lab's RawShares to the total shares for the entire cluster. This number is used by slurm to allocate n percentage of cluster usage to said lab.
RawUsage: The amount of compute usage a lab has run or has requested in their jobs. sshare -l will display a column which lists exact CPU and RAM usages.
EffectvUsage: The percentage of actual cluster use by the lab. This is compared with the lab's NormShares to determine the lab's FairShare score in the next column. If EffectvUsage is greater than the lab's NormShares, the fairshare will always be < 0.5. The reverse is also true. See FairShare Score
FairShare: The lab's resulting FairShare score. See FairShare Score.

Halflife

Slurm FairShare calculations have a halflife of 7 days. So if a lab exhausts its allocated usage and its resulting FairShare is = 0, it's FairShare should reset to 0.5 if 7 days pass with no usage.