...
Requesting Resources
General SciComp Partition for all users on BRB Cluster
scu-cpu: 22 cpu nodes, 7-day runtime limit
scu-gpu: 6 4 gpu nodes, 2-day runtime limit
...
Syntax: sinfo
or sinfo --[optional flags]
...
Output: The output below shows a list of the entire partition on the BRB cluster.
Code Block |
---|
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST scu-cpu* up 7-00:00:00 18 mix scu-node[023,032-033,035-047,049,079] scu-cpu* up 7-00:00:00 4 alloc scu-node[020-022,034] scu-gpu up 2-00:00:00 4 mix scu-node[050-051,081-082] cryo-cpu up 7-00:00:00 1 idle scu-node065 cryo-cpu up 7-00:00:00 1 idle scu-node002 cryo-cpu up 7-00:00:00 2 mix scu-node[001,064] cryo-cpu up 7-00:00:00 10 idle scu-node[063,066-074] cryo-gpu up 2-00:00:00 6 mix scu-node[003-008] cryo-gpu-v100 up 2-00:00:00 3 mix scu-node[054-056] cryo-gpu-p100 up 2-00:00:00 1 mix scu-node060 cryo-gpu-p100 up 2-00:00:00 2 idle scu-node[061-062] boudker-cpu up 7-00:00:00 1 alloc scu-node010 boudker-cpu up 7-00:00:00 1 idle scu-node009 boudker-gpu up 7-00:00:00 2 mix scu-node[011-012] boudker-gpu-p100 up 7-00:00:00 3 idle scu-node[057-059] accardi-gpu up 2-00:00:00 1 mix scu-node015 accardi-gpu up 2-00:00:00 2 alloc scu-node[013-014] accardi-gpu2 up 2-00:00:00 1 idle scu-node016 accardi-cpu up 7-00:00:00 1 idle scu-node017 sackler-gpu up 7-00:00:00 1 mix scu-node018 sackler-cpu up 7-00:00:00 1 mix scu-node019 hwlab-rocky-cpu up 7-00:00:00 3 idle scu-node[052-053,099] hwlab-rocky-gpu up 7-00:00:00 12 mix scu-node[085-096] scu-res up 7-00:00:00 1 idle scu-login03 eliezer-gpu up 7-00:00:00 1 idle scu-node097 |
...
|
SRUN: Interactive Session
Example:
Code Block |
---|
srun --gres=gpu:1 --partition=partition_name --time=01:00:00 -gres=gpu:1 --mem=8G --cpus-per-task=4 --pty bash |
...
--gres=gpu:1
: Allocates 1 GPU to your job.--partition=partition_name
: Specifies the partition to run the job in. Replacepartition_name
with the appropriate partition, likescu-gpu
.--time=01:00:00
: Requests 1 hour of runtime. Adjust the time as needed.--mem=8G
: Requests 8 GB of memory.--cpus-per-task=4
: Requests 4 CPU cores.--pty bash
: Launches an interactive bash shell after resources are allocated.
SBATCH: is used to submit a job script for later execution.
The shebang (#!
) at the beginning of a script tells the shell which interpreter to use for executing the commands. In a Slurm script, it specifies that the script should be run using the Bash
Code Block |
---|
#!/bin/bash |
In Slurm, lines beginning with #SBATCH
are treated as commands. To comment out a Slurm command, you need to add a second #
at the beginning. For example, #SBATCH
is a command, while ##SBATCH
indicates a comment.
The #SBATCH
lines in the script below contain directives that are recommended as defaults for all job submissions.
Code Block |
---|
#!/bin/bash #SBATCH --job-name=gpucpu_job # Job name #SBATCH --output=output_file.txt # Output file #SBATCH --partition=gpucpu_partition # Partition to run the job (e.g., scu-gpu) #SBATCH --gres=gpu:1 # Request 1 GPUcpu) #SBATCH --time=01:00:00 # Max runtime (1 hour) #SBATCH --mem=8G # Memory requested #SBATCH --cpus-per-task=4 # Number of CPU cores per task #SBATCH --output=job_output-%j.out # Standard output file #SBATCH --error=job_error-%j.err # Error output file # Your commands here srun python my_script.py |
Additional flags to add to sbatch script
Code Block |
---|
# Request 1 GPU
#SBATCH --gres=gpu:1
# Set email notifications (optional)
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=your_email@example.com |
...
Submit the batch script
Code Block |
---|
sbatch script.sh |
After the job has been submitted, you should get an output similar to the one below but with a different jobid
.
Code Block |
---|
Submitted batch job 10880609 |
You can use the command below to check the progress of your submitted job in the queue.
syntax: squeue -u <your cwid>
Code Block |
---|
squeue -u scicomp |
output
Code Block |
---|
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10880609 scu-cpu cpu_job scicomp R 00:32 1 scu-node079 |
...
Scontrol
syntax: scontrol show jobid <jobid>
Code Block |
---|
scontrol show job 10880609 |
output
Code Block |
---|
JobId=10880609 JobName=bash
UserId=scicomp GroupId=scicomp MCS_label=N/A
Priority=87769 Nice=0 Account=scu QOS=cpu-limited
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:02:14 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2024-10-10T15:34:12 EligibleTime=2024-10-10T15:34:12
AccrueTime=Unknown
StartTime=2024-10-10T15:34:12 EndTime=2024-10-17T15:34:12 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-10-10T15:34:12 Scheduler=Main
Partition=scu-cpu AllocNode:Sid=scu-login02:31492
ReqNodeList=(null) ExcNodeList=(null)
NodeList=scu-node079
BatchHost=scu-node079
NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=20G,node=1,billing=4
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=4 MinMemoryNode=20G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/athena/labname/scratch/cwid
Power= |
...
Terminating Jobs
The scancel command is used to kill or end the current state(Pending, running) of your job in the queue.
Syntax: scancel <jobid>
or skill <jobid>
Code Block |
---|
scancel 219373 |
Or
Code Block |
---|
skill 219373 |