Page Comparison

Table of Contents

outline	true

What is Slurm?

Slurm (previously Simple Linux Utility for Resource Management), is a modern, open source job scheduler that is highly scaleable and customizable; currently, Slurm is implemented on the majority of the TOP500 supercomputers. Job schedulers enable large numbers of users to fairly and efficiently share large computational resources.

Cluster prerequisites

Before being able to take advantage of our computational resources, you must first set up your environment. This is pretty straightforward, but there are a few steps:

SSH access setup

You need to have your SSH keys set up to access cluster resources. If you haven't done this already, please set up your ssh keys.

Environment setup

While still logged in to an SCU login node, run the following:

Code Block

language	bash
title	Setting up the slurm environment

cat - >> ~/.bashrc <<'EOF'

if [ -n "$SLURM_JOB_ID" ]
then
        source /etc/slurm/slurm_source/slurm_source.sh
fi

alias squeue_long='squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.11l %.6b  %.6D %R"'

EOF
source ~/.bashrc

This command simply references a Slurm environment script (if resources have been requested and allocated), and also provides an alias for a more informative squeue command

SCU clusters and job partitions

Available SCU HPC resources

The SCU uses Slurm to manage the following resources:

General purpose cluster:

The panda cluster (67 nodes): CPU-only cluster intended for general use

CyroEM cluster:

The cryoEM cluster (18 nodes): 15 CPU-only nodes, 3 GPU (P100) nodes. Available only for analysis of cryoEM data

PI-specific clusters:

The Edison cluster (9 GPU nodes): 5 k40m and 4 k80 nodes reserved for the H. Weinstein lab
node178: GPU (p100) node reserved for the Accardi and Huang labs
node179: GPU (p100) node reserved for the Boudker lab
node180: GPU (p100) node reserved for the Blanchard lab
cantley-node0[1-2] (2 nodes): GPU (V100) nodes reserved for the Cantley lab

All jobs, except those submitted the Edison cluster, should be submitted via our Slurm submission node: curie.pbtech. Jobs submitted to the Edison cluster should be submitted from its submission node, edison-mgmt.pbtech.

...

Table of Contents

outline	true

...

What is Slurm?

Slurm (previously Simple Linux Utility for Resource Management), is a modern, open source job scheduler that is highly scaleable and customizable; currently, Slurm is implemented on the majority of the TOP500 supercomputers. Job schedulers enable large numbers of users to fairly and efficiently share large computational resources.

Please see About SCU for more information about our HPC infrastructure.

...

Slurm partitions - BRB Cluster

BRB

SCU cluster partitions

...

Slurm groups nodes into sets referred to as 'partitions'. The above resources belong to one or more Slurm partitions, with each partition possessing its own unique job submission rules. Some nodes belong to multiple partitions because this affords the SCU the configurational flexibility needed to ensure fair allocation of managed resources.

Panda cluster partitions:

panda: 52 CPU-only nodes, 7-day runtime limit

CryoEM cluster:

...

:

scu-cpu: 22 cpu nodes, 7-day runtime limit
scu-gpu: 6 gpu nodes, 2-day runtime limit

CryoEM partitions:

cryo-cpu: 14 CPU-only nodes, 7-day runtime limit
cryo-gpu: 6 GPU nodes, 2-day runtime limit
cryo-gpu-v100: 2 GPU, 2-day runtime limit
cryo-gpu-p100: 3 GPU, 2-day runtime limitcryo-gpu 3 GPU nodes (P100)

PI-specific cluster partitions:

accardi_gpu: 4 GPU nodes, 2-day runtime limit

Edison cluster:

edison: 9 lim
accardi_cpu: 1 CPU node, 7-day runtime limit
boudker_gpu: 2 GPU nodes,
2
7-day runtime limit
edison
boudker_k40mgpu-p100: 5 3 GPU (k40m) nodes, 2 7-day runtime limit
edisonboudker_k80: 4 GPU (k80) cpu: 2 CPU nodes, 2 7-day runtime limit

PI-specific cluster partitions:

accardi_huang_reserve: node178, GPU node, sackler_ cpu: 1 CPU node, 7-day runtime limitboudker
sackler_ reserve: node179, GPU (P100) gpu: 1 GPU node, 7-day runtime limitblanchard
hwlab-rocky_reservegpu: node180, GPU (P100) node12 GPU nodes, 7-day runtime limit
cantleyeliezer-gpu: 2 GPU (V100) nodes 1 GPU node, 7-day runtime limit

Slurm commands can only be run on the slurm submission host, curie.pbtechOther specific cluster partitions:

scu-res: 1 GPU, 7-day runtime limit

Of course, the above will be updated as needed; regardless, to see an up-to-date description of all available partitions, using the command sinfo on curie scu-login02. For a description of all the nodes' # CPU cores, memory (in Mb), runtime limits, and partition, use this command:

Code Block
sinfo -N -o "%25N %5c %10m %15l %25R"

Or if you just want to see a description of the nodes in a given partition:

Code Block
sinfo -N -o "%25N %5c %10m %15l %25R" -p panda # for the partition panda

Interactive Session

Code Block
srun -n1 --pty --partition=scu-cpu --mem=8G bash -i

To request specific numbers of GPUs, you should add your request to your srun/sbatch:

Below is an example of requesting 1 GPU - can request up to 4 GPUs on a single node

Code Block
--gres=gpu:1

...

A simple job submission example

...

Code Block

language	bash
title	hello_slurm.sh

#! /bin/bash -l

#SBATCH --partition=pandascu-cpu   # cluster-specific 
#SBATCH --nodes=1  
#SBATCH --ntasks=1 
#SBATCH --job-name=hello_slurm
#SBATCH --time=00:02:00   # HH/MM/SS
#SBATCH --mem=1G   # memory requested, units available: K,M,G,T
#SBATCH --output hello_slurm-%j.out
#SBATCH --error hello_slurm-%j.err

source ~/.bashrc

echo "Starting at:" `date` >> hello_slurm_output.txt
sleep 30
echo "This is job #:" $SLURM_JOB_ID >> hello_slurm_output.txt
echo "Running on node:" `hostname` >> hello_slurm_output.txt
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> hello_slurm_output.txt
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> hello_slurm_output.txt

exit

...

Next, there are several #SBATCH lines. These lines describe the resource allocation we wish to request:

--partition=pandascu-cpu:

Cluster resources (such as specific nodes, CPUs, memory, GPUs, etc) can be assigned to groups, called partitions. Additionally, the same resources (e.g. a specific node) may belong to multiple cluster partitions. Finally, partitions may be assigned different job priority weights, so that jobs in one partition move through the job queue more quickly than jobs in another partition.
Every job submission script must request a specific partition--otherwise, the default is used. To see what partitions are available on your cluster, click here, or execute the command: sinfo

...

Code Block

language	bash
title	Request an interactive session with X11 forwarding

ssh -X pascal
 
ssh -X curiecwid@scu-login01.med.cornell.edu
 
srun --x11 -n1 --pty --partition=pandascu-cpu --mem=8G bash -i

To test the session, try the following command:

...

Versions Compared

Old Version 26

New Version Current

Key

What is Slurm?

Cluster prerequisites

SSH access setup

Environment setup

SCU clusters and job partitions

Available SCU HPC resources

What is Slurm?

Slurm partitions - BRB Cluster

A simple job submission example