Basic SLURM Commands
Cluster nodes are managed through Slurm. Accessing and running jobs on login nodes or by ssh’ing directly to a node is Strictly forbidden.
Commands | Syntax | Description |
---|---|---|
|
| Submit a batch script to SLURM for processing. |
|
| Show information about your job(s) in the queue. The command when run without the |
|
| Run jobs interactively on the cluster. |
|
| End or cancel a queued job. |
|
| Show information about current and previous jobs. |
|
| To check the status of the cluster and partition, including availability, time limits, and the number of node |
Requesting Resources
General Partition for all users on BRB Cluster
scu-cpu: 22 cpu nodes, 7-day runtime limit
scu-gpu: 4 gpu nodes, 2-day runtime limit
Syntax: sinfo
or sinfo --[optional flags]
sinfo
Output: The output below shows a list of the entire partition on the BRB cluster.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
scu-cpu* up 7-00:00:00 18 mix scu-node[023,032-033,035-047,049,079]
scu-cpu* up 7-00:00:00 4 alloc scu-node[020-022,034]
scu-gpu up 2-00:00:00 4 mix scu-node[050-051,081-082]
cryo-cpu up 7-00:00:00 1 idle scu-node065
cryo-cpu up 7-00:00:00 1 idle scu-node002
cryo-cpu up 7-00:00:00 2 mix scu-node[001,064]
cryo-cpu up 7-00:00:00 10 idle scu-node[063,066-074]
cryo-gpu up 2-00:00:00 6 mix scu-node[003-008]
cryo-gpu-v100 up 2-00:00:00 3 mix scu-node[054-056]
cryo-gpu-p100 up 2-00:00:00 1 mix scu-node060
cryo-gpu-p100 up 2-00:00:00 2 idle scu-node[061-062]
boudker-cpu up 7-00:00:00 1 alloc scu-node010
boudker-cpu up 7-00:00:00 1 idle scu-node009
boudker-gpu up 7-00:00:00 2 mix scu-node[011-012]
boudker-gpu-p100 up 7-00:00:00 3 idle scu-node[057-059]
accardi-gpu up 2-00:00:00 1 mix scu-node015
accardi-gpu up 2-00:00:00 2 alloc scu-node[013-014]
accardi-gpu2 up 2-00:00:00 1 idle scu-node016
accardi-cpu up 7-00:00:00 1 idle scu-node017
sackler-gpu up 7-00:00:00 1 mix scu-node018
sackler-cpu up 7-00:00:00 1 mix scu-node019
hwlab-rocky-cpu up 7-00:00:00 3 idle scu-node[052-053,099]
hwlab-rocky-gpu up 7-00:00:00 12 mix scu-node[085-096]
scu-res up 7-00:00:00 1 idle scu-login03
eliezer-gpu up 7-00:00:00 1 idle scu-node097
Header | Description |
---|---|
| The list of the cluster’s partitions. It’s a set of compute nodes grouped logically |
| The active state of the partition. (up, down, idle) |
| The maximum job execution |
| The total number of nodes per partition. |
|
|
| The list of nodes per partition. |
To request specific numbers of GPUs, you should add your request to your srun/sbatch:
Below is an example of requesting 1 GPU - can request up to 4 GPUs on a single node
--gres=gpu:1 |
SRUN: Interactive Session
Example:
Breakdown:
--gres=gpu:1
: Allocates 1 GPU to your job.--partition=scu-cpu
: Specifies the partition to run the job in.--mem=8G
: Requests 8 GB of memory.--cpus-per-task=4
: Requests 4 CPU cores.--pty bash
: Launches an interactive bash shell after resources are allocated.
SBATCH: is used to submit a job script for later execution.
The shebang (#!
) at the beginning of a script tells the shell which interpreter to use for executing the commands. In a Slurm script, it specifies that the script should be run using the Bash
In Slurm, lines beginning with #SBATCH
are treated as commands. To comment out a Slurm command, you need to add a second #
at the beginning. For example, #SBATCH
is a command, while ##SBATCH
indicates a comment.
The #SBATCH
lines in the script below contain directives that are recommended as defaults for all job submissions.
Additional flags that can be added to the sbatch script above:
Submit the batch script
After the job has been submitted, you should get an output similar to the one below but with a different jobid
.
You can use the command below to check the progress of your submitted job in the queue.
syntax: squeue -u <your cwid>
output
Scontrol
syntax: scontrol show jobid <jobid>
output
Terminating Jobs
The scancel command is used to kill or end the current state(Pending, running) of your job in the queue.
Syntax: scancel <jobid>
or skill <jobid>
Or