Content Comparison

Commands	Syntax	Description
`sbatch`	`sbatch <job-id>`	Submit a batch script to SLURM for processing.
`squeue`	`squeue -u cwid`	Show information about your job(s) in the queue. The command when run without the `-u` flag, shows a list of your job(s) and all other jobs in the queue.
`srun`	`srun <resource-parameters>`	Run jobs interactively on the cluster.
`scancel`	`scancel <job-id>`	End or cancel a queued job.
`sacct`	`sacct`	Show information about current and previous jobs.
`sinfo`	`sinfo`	To check the status of the cluster and partition, including availability, time limits, and the number of node

Requesting Resources

General Partition for all users on BRB Cluster

scu-cpu: 22 cpu nodes, 7-day runtime limit
scu-gpu: 4 gpu nodes, 2-day runtime limit

Syntax: sinfo or sinfo --[optional flags]

...

Output: The output below shows a list of the entire partition on the BRB cluster.

Code Block

PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
scu-cpu*             up 7-00:00:00     18    mix scu-node[023,032-033,035-047,049,079]
scu-cpu*             up 7-00:00:00      4  alloc scu-node[020-022,034]
scu-gpu              up 2-00:00:00      4    mix scu-node[050-051,081-082]
cryo-cpu             up 7-00:00:00      1   idle scu-node065
cryo-cpu             up 7-00:00:00      1   idle scu-node002
cryo-cpu             up 7-00:00:00      2    mix scu-node[001,064]
cryo-cpu             up 7-00:00:00     10   idle scu-node[063,066-074]
cryo-gpu             up 2-00:00:00      6    mix scu-node[003-008]
cryo-gpu-v100        up 2-00:00:00      3    mix scu-node[054-056]
cryo-gpu-p100        up 2-00:00:00      1    mix scu-node060
cryo-gpu-p100        up 2-00:00:00      2   idle scu-node[061-062]
boudker-cpu          up 7-00:00:00      1  alloc scu-node010
boudker-cpu          up 7-00:00:00      1   idle scu-node009
boudker-gpu          up 7-00:00:00      2    mix scu-node[011-012]
boudker-gpu-p100     up 7-00:00:00      3   idle scu-node[057-059]
accardi-gpu          up 2-00:00:00      1    mix scu-node015
accardi-gpu          up 2-00:00:00      2  alloc scu-node[013-014]
accardi-gpu2         up 2-00:00:00      1   idle scu-node016
accardi-cpu          up 7-00:00:00      1   idle scu-node017
sackler-gpu          up 7-00:00:00      1    mix scu-node018
sackler-cpu          up 7-00:00:00      1    mix scu-node019
hwlab-rocky-cpu      up 7-00:00:00      3   idle scu-node[052-053,099]
hwlab-rocky-gpu      up 7-00:00:00     12    mix scu-node[085-096]
scu-res              up 7-00:00:00      1   idle scu-login03
eliezer-gpu          up 7-00:00:00      1   idle scu-node097

Header	Description
`PARTITION`	The list of the cluster’s partitions. It’s a set of compute nodes grouped logically
`AVAIL`	The active state of the partition. (up, down, idle)
`TIMELIMIT`	The maximum job execution `walltime` per partition.
`NODES`	The total number of nodes per partition.
`STATE`	`mix` Only part of the node is allocated to one or more jobs and the rest in an Idle state. `alloc`The entire resource on the node(s) is being utilized `idle`The node is in an idle start and has none of it’s resources being used..
`NODELIST(REASON)`	The list of nodes per partition.

To request specific numbers of GPUs, you should add your request to your srun/sbatch:

Below is an example of requesting 1 GPU - can request up to 4 GPUs on a single node

Code Block
--gres=gpu:1

SRUN: Interactive Session

Example:

Code Block
srun --partition=partition_name --time=01:00:00 -gres=gpu:1 --mem=8G --cpus-per-task=4 --pty bash

Breakdown:

--gres=gpu:1: Allocates 1 GPU to your job.
--partition=partition_name: Specifies the partition to run the job in. Replace partition_name with the appropriate partition, like scu-gpu.
--time=01:00:00: Requests 1 hour of runtime. Adjust the time as needed.
--mem=8G: Requests 8 GB of memory.
--cpus-per-task=4: Requests 4 CPU cores.
--pty bash: Launches an interactive bash shell after resources are allocated.

SBATCH: is used to submit a job script for later execution.

The shebang (#!) at the beginning of a script tells the shell which interpreter to use for executing the commands. In a Slurm script, it specifies that the script should be run using the Bash

Code Block
#!/bin/bash

In Slurm, lines beginning with #SBATCH are treated as commands. To comment out a Slurm command, you need to add a second # at the beginning. For example, #SBATCH is a command, while ##SBATCH indicates a comment.

The #SBATCH lines in the script below contain directives that are recommended as defaults for all job submissions.

Code Block

#!/bin/bash
#SBATCH --job-name=cpu_job        # Job name
#SBATCH --output=output_file.txt  # Output file
#SBATCH --partition=cpu_partition # Partition to run the job (e.g., scu-cpu)
#SBATCH --time=01:00:00           # Max runtime (1 hour)
#SBATCH --mem=8G                  # Memory requested
#SBATCH --cpus-per-task=4         # Number of CPU cores per task
#SBATCH --output=job_output-%j.out   # Standard output file 
#SBATCH --error=job_error-%j.err    # Error output file


# Your commands here
srun python my_script.py

Additional flags to add to sbatch script

Code Block
# Request 1 GPU #SBATCH --gres=gpu:1 # Set email notifications (optional) #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --mail-user=your_email@example.com

...

Submit the batch script

Code Block
sbatch script.sh

After the job has been submitted, you should get an output similar to the one below but with a different jobid.

Code Block
Submitted batch job 10880609

You can use the command below to check the progress of your submitted job in the queue.

syntax: squeue -u <your cwid>

Code Block
squeue -u scicomp

output

Code Block
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 10880609 scu-cpu cpu_job scicomp R 00:32 1 scu-node079

...

Scontrol

syntax: scontrol show jobid <jobid>

Code Block
scontrol show job 10880609

output

Code Block

JobId=10880609 JobName=bash
   UserId=scicomp GroupId=scicomp MCS_label=N/A
   Priority=87769 Nice=0 Account=scu QOS=cpu-limited
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:02:14 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2024-10-10T15:34:12 EligibleTime=2024-10-10T15:34:12
   AccrueTime=Unknown
   StartTime=2024-10-10T15:34:12 EndTime=2024-10-17T15:34:12 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-10-10T15:34:12 Scheduler=Main
   Partition=scu-cpu AllocNode:Sid=scu-login02:31492
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=scu-node079
   BatchHost=scu-node079
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=20G,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=20G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/athena/labname/scratch/cwid
   Power=

...

Terminating Jobs

The scancel command is used to kill or end the current state(Pending, running) of your job in the queue.

Syntax: scancel <jobid> or skill <jobid>

Code Block
scancel 219373

Or

Code Block
skill 219373

Version	Old Version 1	New Version Current
Changes made by	Meghan Netterville	Meghan Netterville
Saved on	Aug 16, 2024	Oct 18, 2024

Versions Compared

Key

Requesting Resources

Breakdown:

Terminating Jobs