Using Slurm

What is Slurm?

Slurm (previously Simple Linux Utility for Resource Management), is a modern, open source job scheduler that is highly scaleable and customizable; currently, Slurm is implemented on the majority of the TOP500 supercomputers. Job schedulers enable large numbers of users to fairly and efficiently share large computational resources.

Cluster prerequisites

Before being able to take advantage of our computational resources, you must first set up your environment. This is pretty straightforward, but there are a few steps:

SSH access setup

You need to have your SSH keys set up to access cluster resources. If you haven't done this already, please set up your ssh keys.

Ensure SSH keys are configured for proper access to the Slurm submit host, curie.pbtech

SCU clusters and job partitions

Available SCU HPC resources

The SCU uses Slurm to manage the following resources:

General purpose cluster:

The panda cluster (72 nodes): 70 CPU-only cluster intended for general use

PI-specific clusters:

The Edison cluster (9 GPU nodes): 5 k40m and 4 k80 nodes reserved for the H. Weinstein lab

All jobs, except those submitted the Edison cluster, should be submitted via our Slurm submission node: curie.pbtech. Jobs submitted to the Edison cluster should be submitted from its submission node, edison-mgmt.pbtech.

Note: Unless you perform cryoEM analysis, or otherwise have specific PI-granted privileges, you will only be able to submit jobs to the panda cluster.

Please see About SCU for more information about our HPC infrastructure.

Slurm partitions - Greenberg Cluster

Slurm groups nodes into sets referred to as 'partitions'. The above resources belong to one or more Slurm partitions, with each partition possessing its own unique job submission rules. Some nodes belong to multiple partitions because this affords the SCU the configurational flexibility needed to ensure fair allocation of managed resources.

Greenberg

Panda cluster partitions:

panda: 70 CPU-only nodes, 7-day runtime limit

Edison cluster:

edison: 9 GPU nodes, 2-day runtime limit
edison_k40m: 5 GPU (k40m) nodes, 2-day runtime limit
edison_k80: 4 GPU (k80) nodes, 2-day runtime limit

Slurm commands can only be run on the slurm submission host, curie.pbtech. (Greenberg)

Of course, the above will be updated as needed; regardless, to see an up-to-date description of all available partitions, using the command sinfo on curie. For a description of all the nodes' # CPU cores, memory (in Mb), runtime limits, and partition, use this command:

sinfo -N -o "%25N %5c %10m %15l %25R"

Or if you just want to see a description of the nodes in a given partition:

sinfo -N -o "%25N %5c %10m %15l %25R" -p panda # for the partition panda

Slurm partitions - BRB Cluster

BRB

SCU cluster partitions:

scu-cpu: 22 cpu nodes, 7-day runtime limit
scu-gpu: 6 gpu nodes, 2-day runtime limit

CryoEM partitions:

cryo-cpu: 14 CPU-only nodes, 7-day runtime limit
cryo-gpu: 6 GPU nodes, 2-day runtime limit
cryo-gpu-v100: 2 GPU, 2-day runtime limit
cryo-gpu-p100: 3 GPU, 2-day runtime limit

PI-specific cluster partitions:

accardi_gpu: 4 GPU nodes, 2-day runtime lim
accardi_cpu: 1 CPU node, 7-day runtime limit
boudker_gpu: 2 GPU nodes, 7-day runtime limit
boudker_gpu-p100: 3 GPU nodes, 7-day runtime limit
boudker_cpu: 2 CPU nodes, 7-day runtime limit
sackler_ cpu: 1 CPU node, 7-day runtime limit
sackler_ gpu: 1 GPU node, 7-day runtime limit
hwlab-rocky_gpu: 12 GPU nodes, 7-day runtime limit
eliezer-gpu: 1 GPU node, 7-day runtime limit

Other specific cluster partitions:

scu-res: 1 GPU, 7-day runtime limit

Of course, the above will be updated as needed; regardless, to see an up-to-date description of all available partitions, using the command sinfo scu-login02. For a description of all the nodes' # CPU cores, memory (in Mb), runtime limits, and partition, use this command:

sinfo -N -o "%25N %5c %10m %15l %25R"

Interactive Session

srun -n1 --pty --partition=scu-cpu --mem=8G bash -i

To request specific numbers of GPUs, you should add your request to your srun/sbatch:

Below is an example of requesting 1 GPU - can request up to 4 GPUs on a single node

--gres=gpu:1

A simple job submission example

To submit computation jobs to the cluster, you need to be logged in to the submit node for the cluster you're using. The following examples assume that you are logged in to the Slurm cluster submit node, curie.

Hello, slurm!

Let's start with a simple submission script. Copy the following into a new file:

hello_slurm.sh

#! /bin/bash -l

#SBATCH --partition=panda   # cluster-specific 
#SBATCH --nodes=1
#SBATCH --ntasks=1 
#SBATCH --job-name=hello_slurm
#SBATCH --time=00:02:00   # HH/MM/SS
#SBATCH --mem=1G   # memory requested, units available: K,M,G,T
#SBATCH --output hello_slurm-%j.out
#SBATCH --error hello_slurm-%j.err

source ~/.bashrc

echo "Starting at:" `date` >> hello_slurm_output.txt
sleep 30
echo "This is job #:" $SLURM_JOB_ID >> hello_slurm_output.txt
echo "Running on node:" `hostname` >> hello_slurm_output.txt
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> hello_slurm_output.txt
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> hello_slurm_output.txt

exit

This file is a pretty simple Slurm submission script; let's break it down:

#! /bin/bash -l:

The first line is just the standard shebang line, with the -l indicating that we want a bash login shell. This should be included in every submission script, unless you are an advanced user and have a specific reason not to include this (e.g. you want to use a different shell than bash).

Next, there are several #SBATCH lines. These lines describe the resource allocation we wish to request:

--partition=panda:

Cluster resources (such as specific nodes, CPUs, memory, GPUs, etc) can be assigned to groups, called partitions. Additionally, the same resources (e.g. a specific node) may belong to multiple cluster partitions. Finally, partitions may be assigned different job priority weights, so that jobs in one partition move through the job queue more quickly than jobs in another partition.
Every job submission script must request a specific partition--otherwise, the default is used. To see what partitions are available on your cluster, click here, or execute the command: sinfo

--nodes=1:

the number of nodes requested

--ntasks=1:

the number of concurrently running tasks. Tasks can be thought of as processes. For this simple serial job, we only need 1 concurrently-running task/process. Also, by default, each task is allocated a single CPU core. For additional information on parallel/multicore environments, click here.

--cpus-per-task=1:

the number of allocated CPUs.

--job-name=test_job:

The job's name--this will appear in the job queue, and is publicly-viewable.

--time=00:00:10:

How long the job will run, formatted as: hours:minutes:seconds. If a specific job duration is not specified, then the default of 10 minutes is used.

--mem=1G

The requested amount of memory for the job; the following units can be used: kibibytes (K),mebibytes (M), gibibytes (G), and tebibytes (T). If a specific amount of memory is not requested, then the partition-dependent default is used
Note: memory is always described in base 2, hence the use of kibibyte (1,024 bytes) instead of kilobyte (1,000 bytes). The curious reader can find additional information here (very optional reading).

Additional job allocation descriptors (#SBATCH) options are described in laters sections, and can also be found here.

After all of the #SBATCH statements, you place the code you actually want to run. In this example, we simply print a couple of statements that describe the job, and redirect this output to the file hello_slurm_output.txt. Some of this information is provided by OS-provided commands (e.g. date and hostname), whereas the rest of the information is provided by Slurm environment variables.

Note: The first line of this code (after the #SBATCH statements) is source ~/.bashrc; this is because Slurm does NOT source your .bashrc, and so you MUST include this line in any submission script that you want to use the environment created by your .bashrc.

Specifically, 3 Slurm environment variables are used in this example: $SLURM_JOB_ID, $SLURM_CLUSTER_NAME, $TMPDIR; these variables store the job's Slurm-assigned ID #, the name of the cluster used, and the local temporary directory your job has access to, respectively. We'll introduce additional Slurm environment variables later; for additional information, click here.

It's very important for your job performance, as well as cluster stability, that intense I/O (e.g. creating many temporary files) is performed in this temporary directory; this is most relevant when using in-house code. Click here for more information on best practices.

Submit the example job

To submit the job described in hello_slurm.sh to the submission queue, execute the following command:

Submit hello_slurm.sh to the scheduler

sbatch hello_slurm.sh
 
> Submitted batch job $JOB_ID_NUMBER

Where $JOB_ID_NUMBER is the ID number assigned to your job. This job will create the file hello_slurm_output.txt that should contain output similar to this:

hello_slurm_output.txt

Starting at Fri Feb 9 13:12:13 EST 2018
This is job 6602
Running on node: node115.panda.pbtech
Running on cluster: panda
This job was assigned the temporary (local) directory: /scratchLocal/CWID_6602

Monitoring job status

To check the status of our job, use the following command:

squeue -u user_ID

user_ID is your CWID. The output is pretty self-explanatory; specifically, the command displays your jobs':

ID number
Job names
Slurm partition
Run state (e.g. running, pending, etc.)
Run time
Number of nodes used
Specific nodes used

For additional information, use the command:

man squeue

This command will display the squeue manual pages (i.e. manpages). Press q to leave the squeue manpage.

If you set up your Slurm environment as described above, you can also use a different command with more advanced output:

squeue_long -u user_ID

Compared to the default squeue command, this has improved formatting and displays the following additional information:

Time limit
GPUs allocated

Interactive sessions

What is an interactive session?

Sometimes, instead of submitting a submission script to the scheduler and waiting for the output, we would like to use allocated resources in an interactive manner (much like you might work locally on your desktop machine). This is especially useful if you wish to use graphical user interfaces (GUIs), debugging and testing code, or if you just have a very simple work flow (ad hoc calculations). In these cases, we can launch an interactive session.

Note: as with any resource allocation, the larger the resource request (e.g. large numbers of CPU cores, large amounts of RAM, etc.), the longer you may have to wait for the resources to be allocated (this is entirely dependent on how much in demand the resources are at that moment).

Starting an interactive session

Execute the following command:

Starting an interactive session

srun -n1 --pty --partition=panda --mem=8G bash -i
 
> your_CWID@nodeXXX ~ $

Depending on the resource request, your interactive session might start right away, or it may take a very long time to start (for the above command, the interactive session should almost always start right away).

Here's a break down of the command:

Note: many of these options are the same as what we used in Step 3.1

-n1:

The number of concurrently running tasks. Tasks can be thought of as processes. For this simple serial job, we only need 1 concurrently-running task/process. Also, by default, each task is allocated a single CPU core. For additional information on parallel/multicore environments, click here.

--pty:

Runs task zero in pseudo terminal mode (this is what grants you a terminal)

--partition=panda:

Cluster resources (such as specific nodes, CPUs, memory, GPUs, etc.) can be assigned to groups, called partitions. Additionally, the same resources (e.g. a specific node) may belong to multiple cluster partitions. Finally, partitions may be assigned different job priority weights, so that jobs in one partition move through the job queue more quickly than jobs in another partition.
Every job submission script must request a specific partition--otherwise, the default is used. To see what partitions are available on your cluster, click here, or execute the command: sinfo

--mem=8G:

The requested amount of memory for the job; the following units can be used: kibibytes (K),mebibytes (M), gibibytes (G), and tebibytes (T). If a specific amount of memory is not requested, then the default is used (8,000M).
Note: memory is always described in base 2, hence the use of kibibyte (1,024 bytes) instead of kilobyte (1,000 bytes). The curious reader can find additional information here (very optional reading).

bash -i:

This tells Slurm to use bash as your shell; the -i tells Slurm to give you an interactive terminal (with prompts). Other shells can also be used (e.g. csh, tcsh), but be sure to use the -i, otherwise an error will occur.

Once you execute this command, eventually the prompt will change (it will reference a specific compute node, as opposed to the submission node); this indicates that you have been allocated the resources and they are accessible via the terminal you are working in. In this example, we have only requested 1 task (which is allocated 1 CPU core by default) and 8 GiB of memory.

Executing a command or script that attempts to use more resources than we allocated to this interactive terminal, the command will fail.

Ending an interactive session

To end the interactive session from within the session, just type the following command:

Ending an interactive session

exit

You can also use the following to end an interactive session from outside of the session:

Ending an interactive session from outside of the session

scancel $interactive_session_job_number

Where $interactive_session_job_number is your session's job-ID #.

It is extremely important that you end interactive sessions when you are finished with your work. DO NOT leave them open overnight, over the weekend, etc. The reason for this is because interactive sessions use precious computational resources--if you leave them open, they hold the allocated resources hostage, preventing other users from being able to use them. This will result in slower job movement in the queue, impeding everyone's work. If the interactive sessions are habitually abused, more restrictive policies will be enacted (which no one wants).

Even a tiny allocation, like the one used in this example, can be problematic, because some jobs need an entire compute node(s), and if an interactive session is running on a given node, then the interactive session would prevent the more demanding job from running on this node.

In short, be a good HPC citizen!

Interactive session with Graphical User Interface (GUI)

If you need to use a GUI on SCU computational resources, you will need to use an interactive session and X11 forwarding. You'll have to log on to an SCU login node and the cluster submission node with X11 forwarding enabled, and then request an interactive session with X11 forwarding as well. The following commands will connect to the SCU login node pascal, then the Slurm submit node curie, and then request an interactive GUI session:

Request an interactive session with X11 forwarding

ssh -X pascal
 
ssh -X curie
 
srun --x11 -n1 --pty --partition=panda --mem=8G bash -i

To test the session, try the following command:

xclock

This opens a window with a graphical clock.

To end the session, close any open windows, and execute the following command in your terminal:

exit

wiki