Commands | Syntax | Description |
---|---|---|
|
| Submit a batch script to SLURM for processing. |
|
| Show information about your job(s) in the queue. The command when run without the |
|
| Run jobs interactively on the cluster. |
|
| End or cancel a queued job. |
|
| Show information about current and previous jobs. |
|
| To check the status of the cluster and partition, including availability, time limits, and the number of node |
Requesting Resources
General Partition for all users on BRB Cluster
scu-cpu: 22 cpu nodes, 7-day runtime limit
scu-gpu: 4 gpu nodes, 2-day runtime limit
Syntax: sinfo
or sinfo --[optional flags]
...
Output: The output below shows a list of the entire partition on the BRB cluster.
Code Block |
---|
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST scu-cpu* up 7-00:00:00 18 mix scu-node[023,032-033,035-047,049,079] scu-cpu* up 7-00:00:00 4 alloc scu-node[020-022,034] scu-gpu up 2-00:00:00 4 mix scu-node[050-051,081-082] cryo-cpu up 7-00:00:00 1 idle scu-node065 cryo-cpu up 7-00:00:00 1 idle scu-node002 cryo-cpu up 7-00:00:00 2 mix scu-node[001,064] cryo-cpu up 7-00:00:00 10 idle scu-node[063,066-074] cryo-gpu up 2-00:00:00 6 mix scu-node[003-008] cryo-gpu-v100 up 2-00:00:00 3 mix scu-node[054-056] cryo-gpu-p100 up 2-00:00:00 1 mix scu-node060 cryo-gpu-p100 up 2-00:00:00 2 idle scu-node[061-062] boudker-cpu up 7-00:00:00 1 alloc scu-node010 boudker-cpu up 7-00:00:00 1 idle scu-node009 boudker-gpu up 7-00:00:00 2 mix scu-node[011-012] boudker-gpu-p100 up 7-00:00:00 3 idle scu-node[057-059] accardi-gpu up 2-00:00:00 1 mix scu-node015 accardi-gpu up 2-00:00:00 2 alloc scu-node[013-014] accardi-gpu2 up 2-00:00:00 1 idle scu-node016 accardi-cpu up 7-00:00:00 1 idle scu-node017 sackler-gpu up 7-00:00:00 1 mix scu-node018 sackler-cpu up 7-00:00:00 1 mix scu-node019 hwlab-rocky-cpu up 7-00:00:00 3 idle scu-node[052-053,099] hwlab-rocky-gpu up 7-00:00:00 12 mix scu-node[085-096] scu-res up 7-00:00:00 1 idle scu-login03 eliezer-gpu up 7-00:00:00 1 idle scu-node097 |
Header | Description |
---|---|
| The list of the cluster’s partitions. It’s a set of compute nodes grouped logically |
| The active state of the partition. (up, down, idle) |
| The maximum job execution |
| The total number of nodes per partition. |
|
|
| The list of nodes per partition. |
To request specific numbers of GPUs, you should add your request to your srun/sbatch:
Below is an example of requesting 1 GPU - can request up to 4 GPUs on a single node
|
SRUN: Interactive Session
Example:
Code Block |
---|
srun --partition=partition_name --time=01:00:00 -gres=gpu:1 --mem=8G --cpus-per-task=4 --pty bash |
Breakdown:
--gres=gpu:1
: Allocates 1 GPU to your job.--partition=partition_name
: Specifies the partition to run the job in. Replacepartition_name
with the appropriate partition, likescu-gpu
.--time=01:00:00
: Requests 1 hour of runtime. Adjust the time as needed.--mem=8G
: Requests 8 GB of memory.--cpus-per-task=4
: Requests 4 CPU cores.--pty bash
: Launches an interactive bash shell after resources are allocated.
SBATCH: is used to submit a job script for later execution.
The shebang (#!
) at the beginning of a script tells the shell which interpreter to use for executing the commands. In a Slurm script, it specifies that the script should be run using the Bash
Code Block |
---|
#!/bin/bash |
In Slurm, lines beginning with #SBATCH
are treated as commands. To comment out a Slurm command, you need to add a second #
at the beginning. For example, #SBATCH
is a command, while ##SBATCH
indicates a comment.
The #SBATCH
lines in the script below contain directives that are recommended as defaults for all job submissions.
Code Block |
---|
#!/bin/bash
#SBATCH --job-name=cpu_job # Job name
#SBATCH --output=output_file.txt # Output file
#SBATCH --partition=cpu_partition # Partition to run the job (e.g., scu-cpu)
#SBATCH --time=01:00:00 # Max runtime (1 hour)
#SBATCH --mem=8G # Memory requested
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --output=job_output-%j.out # Standard output file
#SBATCH --error=job_error-%j.err # Error output file
# Your commands here
srun python my_script.py |
Additional flags to add to sbatch script
Code Block |
---|
# Request 1 GPU
#SBATCH --gres=gpu:1
# Set email notifications (optional)
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=your_email@example.com |
...
Submit the batch script
Code Block |
---|
sbatch script.sh |
After the job has been submitted, you should get an output similar to the one below but with a different jobid
.
Code Block |
---|
Submitted batch job 10880609 |
You can use the command below to check the progress of your submitted job in the queue.
syntax: squeue -u <your cwid>
Code Block |
---|
squeue -u scicomp |
output
Code Block |
---|
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10880609 scu-cpu cpu_job scicomp R 00:32 1 scu-node079 |
...
Scontrol
syntax: scontrol show jobid <jobid>
Code Block |
---|
scontrol show job 10880609 |
output
Code Block |
---|
JobId=10880609 JobName=bash
UserId=scicomp GroupId=scicomp MCS_label=N/A
Priority=87769 Nice=0 Account=scu QOS=cpu-limited
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:02:14 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2024-10-10T15:34:12 EligibleTime=2024-10-10T15:34:12
AccrueTime=Unknown
StartTime=2024-10-10T15:34:12 EndTime=2024-10-17T15:34:12 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-10-10T15:34:12 Scheduler=Main
Partition=scu-cpu AllocNode:Sid=scu-login02:31492
ReqNodeList=(null) ExcNodeList=(null)
NodeList=scu-node079
BatchHost=scu-node079
NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=20G,node=1,billing=4
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=4 MinMemoryNode=20G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/athena/labname/scratch/cwid
Power= |
...
Terminating Jobs
The scancel command is used to kill or end the current state(Pending, running) of your job in the queue.
Syntax: scancel <jobid>
or skill <jobid>
Code Block |
---|
scancel 219373 |
Or
Code Block |
---|
skill 219373 |