Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
outlinetrue

What is Slurm?

Slurm (previously Simple Linux Utility for Resource Management), is a modern, open source job scheduler that is highly scaleable and customizable; currently, Slurm is implemented on the majority of the TOP500 supercomputers. Job schedulers enable large numbers of users to fairly and efficiently share large computational resources.

Cluster prerequisites

Before being able to take advantage of our computational resources, you must first set up your environment. This is pretty straightforward, but there are a few steps:

SSH access setup

You need to have your SSH keys set up to access cluster resources. If you haven't done this already, please set up your ssh keys.

Environment setup

While still logged in to an SCU login node, run the following:

This command simply references a Slurm environment script (if resources have been requested and allocated), and also provides an alias for a more informative squeue command

SCU clusters and job partitions

Available SCU HPC resources

The SCU uses Slurm to manage the following resources:

General purpose cluster:

  • The panda cluster (35 nodes): CPU-only cluster intended for general use

CyroEM cluster: 

  • The cryoEM cluster (18 nodes): 15 CPU-only nodes, 3 GPU (P100) nodes. Available only for analysis of cryoEM data

PI-specific clusters:

  • The Edison cluster (9 GPU nodes):  5 k40m and 4 k80 nodes reserved for the H. Weinstein lab
  • node178: GPU (p100) node reserved for the Accardi and Huang labs
  • node179: GPU (p100) node reserved for the Boudker lab
  • node180: GPU (p100) node reserved for the Blanchard lab
  • cantley-node0[1-2] (2 nodes): GPU (V100) nodes reserved for the Cantley lab 

All jobs, except those submitted the Edison cluster, should be submitted via our Slurm submission node: curie.pbtech. Jobs submitted to the Edison cluster should be submitted from its submission node, edison-mgmt.pbtech.

Warning

Note: Unless you perform cryoEM analysis, or otherwise have specific PI-granted privileges, you will only be able to submit jobs to the panda cluster.

Please see About SCU for more information about our HPC infrastructure.

Slurm partitions

Slurm groups nodes into sets referred to as 'partitions'. The above resources belong to one or more Slurm partitions, with each partition possessing its own unique job submission rules. Some nodes belong to multiple partitions because this affords the SCU the configurational flexibility needed to ensure fair allocation of managed resources.

Panda cluster partitions:

  • panda: 35 CPU-only nodes, 7-day runtime limit

CryoEM cluster:

  • cryo-cpu: 15 CPU-only nodes, 2-day runtime limit
  • cryo-gpu 3 GPU nodes (P100), 2-day runtime limit

Edison cluster:

  • edison: 9 GPU nodes, 2-day runtime limit
  • edison_k40m: 5 GPU (k40m) nodes, 2-day runtime limit
  • edison_k80: 4 GPU (k80) nodes, 2-day runtime limit

PI-specific cluster partitions:

...

boudker_reserve: node179, GPU (P100) node, 7-day runtime limit

...

blanchard_reserve: node180, GPU (P100) node, 7-day runtime limit

...

Table of Contents
outlinetrue


...

What is Slurm?

Slurm (previously Simple Linux Utility for Resource Management), is a modern, open source job scheduler that is highly scaleable and customizable; currently, Slurm is implemented on the majority of the TOP500 supercomputers. Job schedulers enable large numbers of users to fairly and efficiently share large computational resources.

Please see About SCU for more information about our HPC infrastructure.


...

Slurm partitions -  BRB Cluster 

BRB

SCU cluster partitions:

  • scu-cpu: 22 cpu nodes, 7-day runtime limit
  • scu-gpu: 6 gpu nodes, 2-day runtime limit

CryoEM partitions:

  • cryo-cpu: 14 CPU-only nodes, 7-day runtime limit
  • cryo-gpu: 6 GPU nodes, 2-day runtime limit
  • cryo-gpu-v100: 2 GPU, 2-day runtime limit
  • cryo-gpu-p100: 3 GPU, 2-day runtime limit

PI-specific cluster partitions:

  • accardi_gpu: 4 GPU nodes, 2-day runtime lim
  • accardi_cpu: 1 CPU node, 7-day runtime limit
  • boudker_gpu: 2 GPU nodes, 7-day runtime limit

  • boudker_gpu-p100: 3 GPU nodes, 7-day runtime limit
  • boudker_cpu: 2 CPU nodes, 7-day runtime limit
  • sackler_ cpu: 1 CPU node, 7-day runtime limit
  • sackler_ gpu: 1 GPU node, 7-day runtime limit
  • hwlab-rocky_gpu: 12 GPU nodes, 7-day runtime limit
  • eliezer-gpu: 1 GPU node, 7-day runtime limit

Other specific cluster partitions:

  • scu-res: 1 GPU, 7-day runtime limit


Of course, the above will be updated as needed; regardless, to see an up-to-date description of all available partitions, using the command sinfo on curie scu-login02. For a description of all the nodes' # CPU cores, memory (in Mb), runtime limits, and partition, use this command:

Code Block
sinfo -N -o "%25N %5c %10m %15l %25R"

Or if you just want to see a description of the nodes in a given partition:

...

Interactive Session

Code Block
srun -n1 --pty --partition=scu-cpu --mem=8G bash -i


To request specific numbers of GPUs, you should add your request to your srun/sbatch:  

Below is an example of requesting 1 GPU - can request up to 4 GPUs on a single node

Code Block
--gres=gpu:1


...

A simple job submission example

...

Code Block
languagebash
titlehello_slurm.sh
#! /bin/bash -l

#SBATCH --partition=pandascu-cpu   # cluster-specific 
#SBATCH --nodes=1
 
#SBATCH --ntasks=1 
#SBATCH --job-name=hello_slurm
#SBATCH --time=00:02:00   # HH/MM/SS
#SBATCH --mem=1G   # memory requested, units available: K,M,G,T available: K,M,G,T
#SBATCH --output hello_slurm-%j.out
#SBATCH --error hello_slurm-%j.err

source ~/.bashrc

echo "Starting at:" `date` >> hello_slurm_output.txt
sleep 30
echo "This is job #:" $SLURM_JOB_ID >> hello_slurm_output.txt
echo "Running on node:" `hostname` >> hello_slurm_output.txt
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> hello_slurm_output.txt
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> hello_slurm_output.txt

exit

...

Next, there are several #SBATCH lines. These lines describe the resource allocation we wish to request:

--partition=pandascu-cpu:

Cluster resources (such as specific nodes, CPUs, memory, GPUs, etc) can be assigned to groups, called partitions. Additionally, the same resources (e.g. a specific node) may belong to multiple cluster partitions. Finally, partitions may be assigned different job priority weights, so that jobs in one partition move through the job queue more quickly than jobs in another partition.

Every job submission script must request a specific partition--otherwise, the default is used. To see what partitions are available on your cluster, click here, or execute the command: sinfo

...

the number of concurrently running tasks. Tasks can be thought of as processes; this is explained in more detail in Advanced Job Submissions. For this simple serial job, we only need 1 concurrently-running task/process. Also, by default, each task is allocated a single CPU core. For additional information on parallel/multicore environments, click here.

--cpus-per-task=1:

the number of allocated CPUs.

--job-name=test_job:

The job's name--this will appear in the job queue, and is publicly-viewable.

...

The number of concurrently running tasks. Tasks can be thought of as processes; this is explained in more detail in Advanced Job Submissionsof as processes. For this simple serial job, we only need 1 concurrently-running task/process. Also, by default, each task is allocated a single CPU core. For additional information on parallel/multicore environments, click here.

...

Once you execute this command, eventually the prompt will change (it will reference a specific compute node, as opposed to the submission node); this indicates that you have been allocated the resources and they are accessible via the terminal you are working in. In this example, we have only requested 1 task (which is allocated 1 CPU core by default) and 8 GiB of memory.

Warning

Executing a command or script that attempts to use more resources than we allocated to this interactive terminal, the command will fail.

...

To end the interactive session from within the session, just type the following command:

Code Block
languagebash
titleEnding an interactive session
exit

You can also use the following to end an interactive session from outside of the session:

Code Block
languagebash
titleEnding an interactive session from outside of the session
scancel $interactive_session_job_number

Where $interactive_session_job_number is your session's job-ID #.

Warning

It is extremely important that you end interactive sessions when you are finished with your work. DO NOT leave them open overnight, over the weekend, etc. The reason for this is because interactive sessions use precious computational resources--if you leave them open, they hold the allocated resources hostage, preventing other users from being able to use them. This will result in slower job movement in the queue, impeding everyone's work. If the interactive sessions are habitually abused, more restrictive policies will be enacted (which no one wants).

Even a tiny allocation, like the one used in this example, can be problematic, because some jobs need an entire compute node(s), and if an interactive session is running on a given node, then the interactive session would prevent the more demanding job from running on this node.

In short, be a good HPC citizen!

Interactive session with Graphical User Interface (GUI)

If you need to use a GUI on SCU computational resources, you will need to use an interactive session and X11 forwarding. You'll have to log on to an SCU login node and the cluster submission node with X11 forwarding enabled, and then request an interactive session with X11 forwarding as well. The following commands will connect to the SCU login node pascal, then the Slurm submit node curie, and then request an interactive GUI session:

Code Block
languagebash
titleRequest an interactive session with X11 forwarding
ssh -X pascal
 
ssh -X curie
 
srun --x11 -n1 --pty --partition=panda --mem=8G bash -i

To test the session, try the following command:

Code Block
languagebash
xclock

This opens a window with a graphical clock.

To end the session, close any open windows, and execute the following command in your terminalbeen allocated the resources and they are accessible via the terminal you are working in. In this example, we have only requested 1 task (which is allocated 1 CPU core by default) and 8 GiB of memory.

Warning

Executing a command or script that attempts to use more resources than we allocated to this interactive terminal, the command will fail.


Ending an interactive session

To end the interactive session from within the session, just type the following command:

Code Block
languagebash
titleEnding an interactive session
exit

You can also use the following to end an interactive session from outside of the session:

Code Block
languagebash
titleEnding an interactive session from outside of the session
scancel $interactive_session_job_number

Where $interactive_session_job_number is your session's job-ID #.

Warning

It is extremely important that you end interactive sessions when you are finished with your work. DO NOT leave them open overnight, over the weekend, etc. The reason for this is because interactive sessions use precious computational resources--if you leave them open, they hold the allocated resources hostage, preventing other users from being able to use them. This will result in slower job movement in the queue, impeding everyone's work. If the interactive sessions are habitually abused, more restrictive policies will be enacted (which no one wants).

Even a tiny allocation, like the one used in this example, can be problematic, because some jobs need an entire compute node(s), and if an interactive session is running on a given node, then the interactive session would prevent the more demanding job from running on this node.

In short, be a good HPC citizen!

Interactive session with Graphical User Interface (GUI)

If you need to use a GUI on SCU computational resources, you will need to use an interactive session and X11 forwarding. You'll have to log on to an SCU login node and the cluster submission node with X11 forwarding enabled, and then request an interactive session with X11 forwarding as well. The following commands will connect to the SCU login node pascal, then the Slurm submit node curie, and then request an interactive GUI session:

Code Block
Page Properties
hiddentrue
Related issues
languagebash
exit

Further reading

Filter by label (Content by label)
showLabelsfalse
max5
spacescom.atlassian.confluence.content.render.xhtml.model.resource.identifiers.SpaceResourceIdentifier@28a7b0
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ( "slurm" , "compute" , "hpc" ) and type = "page" and space = "WIKI"
labelsslurm hpc compute

titleRequest an interactive session with X11 forwarding
ssh -X cwid@scu-login01.med.cornell.edu
 
srun --x11 -n1 --pty --partition=scu-cpu --mem=8G bash -i

To test the session, try the following command:

Code Block
languagebash
xclock

This opens a window with a graphical clock.

To end the session, close any open windows, and execute the following command in your terminal:

Code Block
languagebash
titleSetting up the slurm environment
cat - >> ~/.bashrc <<'EOF'

if [ -n "$SLURM_JOB_ID" ]
then
        source /etc/slurm/slurm_source/slurm_source.sh
fi

alias squeue_long='squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.11l %.6b  %.6D %R"'

EOF
source ~/.bashrcexit

...

Further reading


Filter by label (Content by label)
showLabelsfalse
max5
spacescom.atlassian.confluence.content.render.xhtml.model.resource.identifiers.SpaceResourceIdentifier@28a7b0
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ( "slurm" , "compute" , "hpc" ) and type = "page" and space = "WIKI"
labelsslurm hpc compute


Page Properties
hiddentrue


Related issues