Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
minLevel2
maxLevel6
outlinefalse
styledecimal
typelist
printabletrue

...

Computational jobs on the AI cluster are managed with a SLURM job manager. We provide an in-depth tutorial on how to use SLURM <placeholder>, but some basic examples that are immediately applicable on the AI cluster will be discussed in this section.

Important notice:

Warning

Do not run computations on login nodes

...

Code Block
gcc -fopenmp code.c

it It will generate an executable a.out that we will use to illustrate how to submit jobs. This code accepts two arguments: number of Monte Carlo samples, and number of parallel threads to be used and can be executed as ./a.out 1000 8 (1000 samples, run with 8 parallel threads).

SLURM Batch job example

Here is an example of the batch script that can be ran on the cluster. In this script we are requesting a single node, with 4 CPU cores, used in an SMP mode.

Code Block
languagebash
#!/bin/bash

#SBATCH --job-name=testjup  <jobname> # give your job a name
#SBATCH --nodes=1            # asking for 1 compute node
#SBATCH --ntasks=1           # 1 task
#SBATCH --cpus-per-task=4    # 4 CPU cores per task, so 4 cores in total         
#SBATCH --time=00:30:00      # set this time according to your need, 30 minutes here
#SBATCH --mem=8GB            # request appropriate amount of RAM
##SBATCH --gres=gpu:1        # if you need to use a GPU, note that this line is commented out
#SBATCH -p <partition_name>  # specify your partition. Make sure to run in the 

cd <code_directory>
export OMP_NUM_THREADS=4    # OMP_NUM_THREADS should be equal to the number of cores you are requesting
./a.out 1000000000 4        # running 1000000000 samples on 4 cores

Submit this job to the queue and inspect the output once it’s finished:

Code Block
sbatch script_name

You can view your jobs in the queue with:

Code Block
squeue -u <cwid>

Try to run a few jobs using different number of cores and see how it scales almost linearly.

Info

The more resources you request from SLURM, the harder it will be for SLURM to allocate space for your job. For parallel jobs, the more CPUs, the faster it runs, but the job may be stuck in the queue for longer. Be aware of this trade-off, there is no universal answer on what’s the best strategy, it usually depends on what kind of resources your particular job needs and how busy is the cluster at the moment.

SLURM interactive job example

Even though interactive jobs are inefficient are not recommended, sometimes there is no other way to do certain things. If you need to run an interactive job, here is how it can be done:

Code Block
srun --nodes 1 \
    --tasks-per-node 1 \
    --cpus-per-task 4 \
    --partition=<partition_name> \
    --gres=gpu:1 \
    --pty /bin/bash -i

Once the job is successfully started, you will be dropped into interactive BASH session on one of the compute nodes. The same scheduling considerations apply here – the more resources you are requesting, the longer is the potential wait time.

Once you are done with your interactive work, simply run exit command. It will kill your bash process and therefore the whole SLURM job will be cancelled.

Jupyter job

Monitoring job status