Basic SLURM Commands

Commands	Syntax	Description
`sbatch`	`sbatch <job-id>`	Submit a batch script to SLURM for processing.
`squeue`	`squeue -u cwid`	Show information about your job(s) in the queue. The command when run without the `-u` flag, shows a list of your job(s) and all other jobs in the queue.
`srun`	`srun <resource-parameters>`	Run jobs interactively on the cluster.
`scancel`	`scancel <job-id>`	End or cancel a queued job.
`sacct`	`sacct`	Show information about current and previous jobs.
`sinfo`	`sinfo`	To check the status of the cluster and partition, including availability, time limits, and the number of node

Requesting Resources

General SciComp Partition for all users on BRB Cluster

scu-cpu: 22 cpu nodes, 7-day runtime limit
scu-gpu: 6 gpu nodes, 2-day runtime limit

Syntax: sinfo or sinfo --[optional flags]

sinfo

Output: The output below shows a list of the entire partition on the BRB cluster.

PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
scu-cpu*             up 7-00:00:00     18    mix scu-node[023,032-033,035-047,049,079]
scu-cpu*             up 7-00:00:00      4  alloc scu-node[020-022,034]
scu-gpu              up 2-00:00:00      4    mix scu-node[050-051,081-082]
cryo-cpu             up 7-00:00:00      1   idle scu-node065
cryo-cpu             up 7-00:00:00      1   idle scu-node002
cryo-cpu             up 7-00:00:00      2    mix scu-node[001,064]
cryo-cpu             up 7-00:00:00     10   idle scu-node[063,066-074]
cryo-gpu             up 2-00:00:00      6    mix scu-node[003-008]
cryo-gpu-v100        up 2-00:00:00      3    mix scu-node[054-056]
cryo-gpu-p100        up 2-00:00:00      1    mix scu-node060
cryo-gpu-p100        up 2-00:00:00      2   idle scu-node[061-062]
boudker-cpu          up 7-00:00:00      1  alloc scu-node010
boudker-cpu          up 7-00:00:00      1   idle scu-node009
boudker-gpu          up 7-00:00:00      2    mix scu-node[011-012]
boudker-gpu-p100     up 7-00:00:00      3   idle scu-node[057-059]
accardi-gpu          up 2-00:00:00      1    mix scu-node015
accardi-gpu          up 2-00:00:00      2  alloc scu-node[013-014]
accardi-gpu2         up 2-00:00:00      1   idle scu-node016
accardi-cpu          up 7-00:00:00      1   idle scu-node017
sackler-gpu          up 7-00:00:00      1    mix scu-node018
sackler-cpu          up 7-00:00:00      1    mix scu-node019
hwlab-rocky-cpu      up 7-00:00:00      3   idle scu-node[052-053,099]
hwlab-rocky-gpu      up 7-00:00:00     12    mix scu-node[085-096]
scu-res              up 7-00:00:00      1   idle scu-login03
eliezer-gpu          up 7-00:00:00      1   idle scu-node097

Header	Description
`PARTITION`	The list of the cluster’s partitions. It’s a set of compute nodes grouped logically
`AVAIL`	The active state of the partition. (up, down, idle)
`TIMELIMIT`	The maximum job execution `walltime` per partition.
`NODES`	The total number of nodes per partition.
`STATE`	`mix` Only part of the node is allocated to one or more jobs and the rest in an Idle state. `alloc`The entire resource on the node(s) is being utilized `idle`The node is in an idle start and has none of it’s resources being used..
`NODELIST(REASON)`	The list of nodes per partition.

To request specific numbers of GPUs, you should add your request to your srun/sbatch:

Below is an example of requesting 1 GPU - can request up to 4 GPUs on a single node

--gres=gpu:1

SRUN: Interactive Session

Example:

srun --gres=gpu:1 --partition=partition_name --time=01:00:00 --mem=8G --cpus-per-task=4 --pty bash

Breakdown:

--gres=gpu:1: Allocates 1 GPU to your job.
--partition=partition_name: Specifies the partition to run the job in. Replace partition_name with the appropriate partition, like scu-gpu.
--time=01:00:00: Requests 1 hour of runtime. Adjust the time as needed.
--mem=8G: Requests 8 GB of memory.
--cpus-per-task=4: Requests 4 CPU cores.
--pty bash: Launches an interactive bash shell after resources are allocated.

SBATCH: submission script

#!/bin/bash
#SBATCH --job-name=gpu_job        # Job name
#SBATCH --output=output_file.txt  # Output file
#SBATCH --partition=gpu_partition # Partition to run the job (e.g., scu-gpu)
#SBATCH --gres=gpu:1              # Request 1 GPU
#SBATCH --time=01:00:00           # Max runtime (1 hour)
#SBATCH --mem=8G                  # Memory requested
#SBATCH --cpus-per-task=4         # Number of CPU cores per task

# Your commands here
srun python my_script.py