Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Commands

Syntax

Description

sbatch

sbatch <job-id>

Submit a batch script to SLURM for processing.

squeue

squeue -u cwid

Show information about your job(s) in the queue. The command when run without the -u flag, shows a list of your job(s) and all other jobs in the queue.

srun

srun <resource-parameters>

Run jobs interactively on the cluster.

scancel

scancel <job-id>

End or cancel a queued job.

sacct

sacct

Show information about current and previous jobs.

sinfo

sinfo

To check the status of the cluster and partition, including availability, time limits, and the number of node

Requesting Resources

General Partition for all users on BRB Cluster

  • scu-cpu: 22 cpu nodes, 7-day runtime limit

  • scu-gpu: 4 gpu nodes, 2-day runtime limit

Syntax: sinfo or sinfo --[optional flags]

...

Output: The output below shows a list of the entire partition on the BRB cluster.

Code Block
PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
scu-cpu*             up 7-00:00:00     18    mix scu-node[023,032-033,035-047,049,079]
scu-cpu*             up 7-00:00:00      4  alloc scu-node[020-022,034]
scu-gpu              up 2-00:00:00      4    mix scu-node[050-051,081-082]
cryo-cpu             up 7-00:00:00      1   idle scu-node065
cryo-cpu             up 7-00:00:00      1   idle scu-node002
cryo-cpu             up 7-00:00:00      2    mix scu-node[001,064]
cryo-cpu             up 7-00:00:00     10   idle scu-node[063,066-074]
cryo-gpu             up 2-00:00:00      6    mix scu-node[003-008]
cryo-gpu-v100        up 2-00:00:00      3    mix scu-node[054-056]
cryo-gpu-p100        up 2-00:00:00      1    mix scu-node060
cryo-gpu-p100        up 2-00:00:00      2   idle scu-node[061-062]
boudker-cpu          up 7-00:00:00      1  alloc scu-node010
boudker-cpu          up 7-00:00:00      1   idle scu-node009
boudker-gpu          up 7-00:00:00      2    mix scu-node[011-012]
boudker-gpu-p100     up 7-00:00:00      3   idle scu-node[057-059]
accardi-gpu          up 2-00:00:00      1    mix scu-node015
accardi-gpu          up 2-00:00:00      2  alloc scu-node[013-014]
accardi-gpu2         up 2-00:00:00      1   idle scu-node016
accardi-cpu          up 7-00:00:00      1   idle scu-node017
sackler-gpu          up 7-00:00:00      1    mix scu-node018
sackler-cpu          up 7-00:00:00      1    mix scu-node019
hwlab-rocky-cpu      up 7-00:00:00      3   idle scu-node[052-053,099]
hwlab-rocky-gpu      up 7-00:00:00     12    mix scu-node[085-096]
scu-res              up 7-00:00:00      1   idle scu-login03
eliezer-gpu          up 7-00:00:00      1   idle scu-node097

Header

Description

PARTITION

The list of the cluster’s partitions. It’s a set of compute nodes grouped logically

AVAIL

The active state of the partition. (up, down, idle)

TIMELIMIT

The maximum job execution walltime per partition.

NODES

The total number of nodes per partition.

STATE

mix Only part of the node is allocated to one or more jobs and the rest in an Idle state.

allocThe entire resource on the node(s) is being utilized

idleThe node is in an idle start and has none of it’s resources being used..

NODELIST(REASON)

The list of nodes per partition.

To request specific numbers of GPUs, you should add your request to your srun/sbatch:  

Below is an example of requesting 1 GPU - can request up to 4 GPUs on a single node

Code Block
--gres=gpu:1

SRUN: Interactive Session

Example:

Code Block
srun --partition=partition_name --time=01:00:00 -gres=gpu:1 --mem=8G --cpus-per-task=4 --pty bash

Breakdown:

  • --gres=gpu:1: Allocates 1 GPU to your job.

  • --partition=partition_name: Specifies the partition to run the job in. Replace partition_name with the appropriate partition, like scu-gpu.

  • --time=01:00:00: Requests 1 hour of runtime. Adjust the time as needed.

  • --mem=8G: Requests 8 GB of memory.

  • --cpus-per-task=4: Requests 4 CPU cores.

  • --pty bash: Launches an interactive bash shell after resources are allocated.

SBATCH: is used to submit a job script for later execution.

The shebang (#!) at the beginning of a script tells the shell which interpreter to use for executing the commands. In a Slurm script, it specifies that the script should be run using the Bash

Code Block
#!/bin/bash

In Slurm, lines beginning with #SBATCH are treated as commands. To comment out a Slurm command, you need to add a second # at the beginning. For example, #SBATCH is a command, while ##SBATCH indicates a comment.

The #SBATCH lines in the script below contain directives that are recommended as defaults for all job submissions.

Code Block
#!/bin/bash
#SBATCH --job-name=cpu_job        # Job name
#SBATCH --output=output_file.txt  # Output file
#SBATCH --partition=cpu_partition # Partition to run the job (e.g., scu-cpu)
#SBATCH --time=01:00:00           # Max runtime (1 hour)
#SBATCH --mem=8G                  # Memory requested
#SBATCH --cpus-per-task=4         # Number of CPU cores per task
#SBATCH --output=job_output-%j.out   # Standard output file 
#SBATCH --error=job_error-%j.err    # Error output file


# Your commands here
srun python my_script.py

Additional flags to add to sbatch script

Code Block
# Request 1 GPU
#SBATCH --gres=gpu:1 

# Set email notifications (optional)
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=your_email@example.com

...

Submit the batch script

Code Block
sbatch script.sh

After the job has been submitted, you should get an output similar to the one below but with a different jobid.

Code Block
Submitted batch job 10880609

You can use the command below to check the progress of your submitted job in the queue.

syntax: squeue -u <your cwid>

Code Block
squeue -u scicomp

output

Code Block
JOBID  PARTITION    NAME     USER  ST       TIME  NODES NODELIST(REASON)
10880609   scu-cpu  cpu_job  scicomp  R     00:32      1 scu-node079

...

Scontrol

syntax: scontrol show jobid <jobid>

Code Block
scontrol show job 10880609

output

Code Block
JobId=10880609 JobName=bash
   UserId=scicomp GroupId=scicomp MCS_label=N/A
   Priority=87769 Nice=0 Account=scu QOS=cpu-limited
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:02:14 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2024-10-10T15:34:12 EligibleTime=2024-10-10T15:34:12
   AccrueTime=Unknown
   StartTime=2024-10-10T15:34:12 EndTime=2024-10-17T15:34:12 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-10-10T15:34:12 Scheduler=Main
   Partition=scu-cpu AllocNode:Sid=scu-login02:31492
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=scu-node079
   BatchHost=scu-node079
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=20G,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=20G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/athena/labname/scratch/cwid
   Power=

...

Terminating Jobs

The scancel command is used to kill or end the current state(Pending, running) of your job in the queue.

Syntax: scancel <jobid> or skill <jobid>

Code Block
scancel 219373

Or

Code Block
skill 219373