Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
Computational jobs on the AI cluster are managed with a SLURM job manager. We provide an in-depth tutorial on how to use SLURM <placeholder>, but some basic examples that are immediately applicable on the AI cluster will be discussed in this section.
Important notice:
Warning |
---|
Do not run computations on login nodes |
...
Code Block |
---|
gcc -fopenmp code.c |
it It will generate an executable a.out
that we will use to illustrate how to submit jobs. This code accepts two arguments: number of Monte Carlo samples, and number of parallel threads to be used and can be executed as ./a.out 1000 8
(1000 samples, run with 8 parallel threads).
SLURM Batch job example
Here is an example of the batch script that can be ran on the cluster. In this script we are requesting a single node, with 4 CPU cores, used in an SMP mode.
Code Block | ||
---|---|---|
| ||
#!/bin/bash #SBATCH --job-name=testjup <jobname> # give your job a name #SBATCH --nodes=1 # asking for 1 compute node #SBATCH --ntasks=1 # 1 task #SBATCH --cpus-per-task=4 # 4 CPU cores per task, so 4 cores in total #SBATCH --time=00:30:00 # set this time according to your need, 30 minutes here #SBATCH --mem=8GB # request appropriate amount of RAM ##SBATCH --gres=gpu:1 # if you need to use a GPU, note that this line is commented out #SBATCH -p <partition_name> # specify your partition. Make sure to run in the cd <code_directory> export OMP_NUM_THREADS=4 # OMP_NUM_THREADS should be equal to the number of cores you are requesting ./a.out 1000000000 4 # running 1000000000 samples on 4 cores |
Submit this job to the queue and inspect the output once it’s finished:
Code Block |
---|
sbatch script_name |
You can view your jobs in the queue with:
Code Block |
---|
squeue -u <cwid> |
Try to run a few jobs using different number of cores and see how it scales almost linearly.
Info |
---|
The more resources you request from SLURM, the harder it will be for SLURM to allocate space for your job. For parallel jobs, the more CPUs, the faster it runs, but the job may be stuck in the queue for longer. Be aware of this trade-off, there is no universal answer on what’s the best strategy, it usually depends on what kind of resources your particular job needs and how busy is the cluster at the moment. |
SLURM interactive job example
Even though interactive jobs are inefficient are not recommended, sometimes there is no other way to do certain things. If you need to run an interactive job, here is how it can be done:
Code Block |
---|
srun --nodes 1 \
--tasks-per-node 1 \
--cpus-per-task 4 \
--partition=<partition_name> \
--gres=gpu:1 \
--pty /bin/bash -i |
Once the job is successfully started, you will be dropped into interactive BASH session on one of the compute nodes. The same scheduling considerations apply here – the more resources you are requesting, the longer is the potential wait time.
Once you are done with your interactive work, simply run exit
command. It will kill your bash process and therefore the whole SLURM job will be cancelled.