Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Commands

Syntax

Description

sbatch

sbatch <job-id>

Submit a batch script to SLURM for processing.

squeue

squeue -u cwid

Show information about your job(s) in the queue. The command when run without the -u flag, shows a list of your job(s) and all other jobs in the queue.

srun

srun <resource-parameters>

Run jobs interactively on the cluster.

scancel

scancel <job-id>

End or cancel a queued job.

sacct

sacct

Show information about current and previous jobs.

sinfo

sinfo

To check the status of the cluster and partition, including availability, time limits, and the number of node

Requesting Resources

General Partition for all users on BRB Cluster

  • scu-cpu: 22 cpu nodes, 7-day runtime limit

  • scu-gpu: 4 gpu nodes, 2-day runtime limit

Syntax: sinfo or sinfo --[optional flags]

...

Output: The output below shows a list of the entire partition on the BRB cluster.

Code Block
PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
scu-cpu*             up 7-00:00:00     18    mix scu-node[023,032-033,035-047,049,079]
scu-cpu*             up 7-00:00:00      4  alloc scu-node[020-022,034]
scu-gpu              up 2-00:00:00      4    mix scu-node[050-051,081-082]
cryo-cpu             up 7-00:00:00      1   idle scu-node065
cryo-cpu             up 7-00:00:00      1   idle scu-node002
cryo-cpu             up 7-00:00:00      2    mix scu-node[001,064]
cryo-cpu             up 7-00:00:00     10   idle scu-node[063,066-074]
cryo-gpu             up 2-00:00:00      6    mix scu-node[003-008]
cryo-gpu-v100        up 2-00:00:00      3    mix scu-node[054-056]
cryo-gpu-p100        up 2-00:00:00      1    mix scu-node060
cryo-gpu-p100        up 2-00:00:00      2   idle scu-node[061-062]
boudker-cpu          up 7-00:00:00      1  alloc scu-node010
boudker-cpu          up 7-00:00:00      1   idle scu-node009
boudker-gpu          up 7-00:00:00      2    mix scu-node[011-012]
boudker-gpu-p100     up 7-00:00:00      3   idle scu-node[057-059]
accardi-gpu          up 2-00:00:00      1    mix scu-node015
accardi-gpu          up 2-00:00:00      2  alloc scu-node[013-014]
accardi-gpu2         up 2-00:00:00      1   idle scu-node016
accardi-cpu          up 7-00:00:00      1   idle scu-node017
sackler-gpu          up 7-00:00:00      1    mix scu-node018
sackler-cpu          up 7-00:00:00      1    mix scu-node019
hwlab-rocky-cpu      up 7-00:00:00      3   idle scu-node[052-053,099]
hwlab-rocky-gpu      up 7-00:00:00     12    mix scu-node[085-096]
scu-res              up 7-00:00:00      1   idle scu-login03
eliezer-gpu          up 7-00:00:00      1   idle scu-node097

...

Code Block
--gres=gpu:1

SRUN: Interactive Session

Example:

Code Block
srun --gres=gpu:1 --partition=partition_name --time=01:00:00 -gres=gpu:1 --mem=8G --cpus-per-task=4 --pty bash

...

  • --gres=gpu:1: Allocates 1 GPU to your job.

  • --partition=partition_name: Specifies the partition to run the job in. Replace partition_name with the appropriate partition, like scu-gpu.

  • --time=01:00:00: Requests 1 hour of runtime. Adjust the time as needed.

  • --mem=8G: Requests 8 GB of memory.

  • --cpus-per-task=4: Requests 4 CPU cores.

  • --pty bash: Launches an interactive bash shell after resources are allocated.

SBATCH: is used to submit a job script for later execution.

The shebang (#!) at the beginning of a script tells the shell which interpreter to use for executing the commands. In a Slurm script, it specifies that the script should be run using the Bash

Code Block
#!/bin/bash

In Slurm, lines beginning with #SBATCH are treated as commands. To comment out a Slurm command, you need to add a second # at the beginning. For example, #SBATCH is a command, while ##SBATCH indicates a comment.

The #SBATCH lines in the script below contain directives that are recommended as defaults for all job submissions.

Code Block
#!/bin/bash
#SBATCH --job-name=gpucpu_job        # Job name
#SBATCH --output=output_file.txt  # Output file
#SBATCH --partition=gpucpu_partition # Partition to run the job (e.g., scu-gpu)
#SBATCH --gres=gpu:1              # Request 1 GPUcpu)
#SBATCH --time=01:00:00           # Max runtime (1 hour)
#SBATCH --mem=8G                  # Memory requested
#SBATCH --cpus-per-task=4         # Number of CPU cores per task
#SBATCH --output=job_output-%j.out   # Standard output file 
#SBATCH --error=job_error-%j.err    # Error output file


# Your commands here
srun python my_script.py

Additional flags to add to sbatch script

Code Block
# Request 1 GPU
#SBATCH --gres=gpu:1 

# Set email notifications (optional)
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=your_email@example.com

...

Submit the batch script

Code Block
sbatch script.sh

After the job has been submitted, you should get an output similar to the one below but with a different jobid.

Code Block
Submitted batch job 10880609

You can use the command below to check the progress of your submitted job in the queue.

syntax: squeue -u <your cwid>

Code Block
squeue -u scicomp

output

Code Block
JOBID  PARTITION    NAME     USER  ST       TIME  NODES NODELIST(REASON)
10880609   scu-cpu  cpu_job  scicomp  R     00:32      1 scu-node079

...

Scontrol

syntax: scontrol show jobid <jobid>

Code Block
scontrol show job 10880609

output

Code Block
JobId=10880609 JobName=bash
   UserId=scicomp GroupId=scicomp MCS_label=N/A
   Priority=87769 Nice=0 Account=scu QOS=cpu-limited
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:02:14 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2024-10-10T15:34:12 EligibleTime=2024-10-10T15:34:12
   AccrueTime=Unknown
   StartTime=2024-10-10T15:34:12 EndTime=2024-10-17T15:34:12 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-10-10T15:34:12 Scheduler=Main
   Partition=scu-cpu AllocNode:Sid=scu-login02:31492
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=scu-node079
   BatchHost=scu-node079
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=20G,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=20G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/athena/labname/scratch/cwid
   Power=

...

Terminating Jobs

The scancel command is used to kill or end the current state(Pending, running) of your job in the queue.

Syntax: scancel <jobid> or skill <jobid>

Code Block
scancel 219373

Or

Code Block
skill 219373