Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Requesting Resources

General SciComp Partition for all users on BRB Cluster

  • scu-cpu: 22 cpu nodes, 7-day runtime limit

  • scu-gpu: 6 4 gpu nodes, 2-day runtime limit

...

Syntax: sinfo or sinfo --[optional flags]

...

SRUN: Interactive Session

Example:

Code Block
srun --gres=gpu:1 --partition=partition_name --time=01:00:00 -gres=gpu:1 --mem=8G --cpus-per-task=4 --pty bash

...

  • --gres=gpu:1: Allocates 1 GPU to your job.

  • --partition=partition_name: Specifies the partition to run the job in. Replace partition_name with the appropriate partition, like scu-gpu.

  • --time=01:00:00: Requests 1 hour of runtime. Adjust the time as needed.

  • --mem=8G: Requests 8 GB of memory.

  • --cpus-per-task=4: Requests 4 CPU cores.

  • --pty bash: Launches an interactive bash shell after resources are allocated.

SBATCH: submission scriptis used to submit a job script for later execution.

The shebang (#!) at the beginning of a script tells the shell which interpreter to use for executing the commands. In a Slurm script, it specifies that the script should be run using the Bash

Code Block
#!/bin/bash

In Slurm, lines beginning with #SBATCH are treated as commands. To comment out a Slurm command, you need to add a second # at the beginning. For example, #SBATCH is a command, while ##SBATCH indicates a comment.

The #SBATCH lines in the script below contain directives that are recommended as defaults for all job submissions.

Code Block
#!/bin/bash
#SBATCH --job-name=gpucpu_job        # Job name
#SBATCH --output=output_file.txt  # Output file
#SBATCH --partition=gpucpu_partition # Partition to run the job (e.g., scu-gpucpu)
#SBATCH --gres=gpu:1              # Request 1 GPU
#SBATCH --time=01:00:00           # Max runtime (1 hour)
#SBATCH --mem=8G                  # Memory requested
#SBATCH --cpus-per-task=4         # Number of CPU cores per task
#SBATCH --output=job_output-%j.out   # Standard output file 
#SBATCH --error=job_error-%j.err    # Error output file


# Your commands here
srun python my_script.py

Additional flags to add to sbatch script

Code Block
# Request 1 GPU
#SBATCH --gres=gpu:1 

# Set email notifications (optional)
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=your_email@example.com

...

Submit the batch script

Code Block
sbatch script.sh

After the job has been submitted, you should get an output similar to the one below but with a different jobid.

Code Block
Submitted batch job 10880609

You can use the command below to check the progress of your submitted job in the queue.

syntax: squeue -u <your cwid>

Code Block
squeue -u scicomp

output

Code Block
JOBID  PARTITION    NAME     USER  ST       TIME  NODES NODELIST(REASON)
10880609   scu-cpu  cpu_job  scicomp  R     00:32      1 scu-node079

...

Scontrol

syntax: scontrol show jobid <jobid>

Code Block
scontrol show job 10880609

output

Code Block
JobId=10880609 JobName=bash
   UserId=scicomp GroupId=scicomp MCS_label=N/A
   Priority=87769 Nice=0 Account=scu QOS=cpu-limited
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:02:14 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2024-10-10T15:34:12 EligibleTime=2024-10-10T15:34:12
   AccrueTime=Unknown
   StartTime=2024-10-10T15:34:12 EndTime=2024-10-17T15:34:12 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-10-10T15:34:12 Scheduler=Main
   Partition=scu-cpu AllocNode:Sid=scu-login02:31492
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=scu-node079
   BatchHost=scu-node079
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=20G,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=4 MinMemoryNode=20G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=bash
   WorkDir=/athena/labname/scratch/cwid
   Power=

...

Terminating Jobs

The scancel command is used to kill or end the current state(Pending, running) of your job in the queue.

Syntax: scancel <jobid> or skill <jobid>

Code Block
scancel 219373

Or

Code Block
skill 219373