AI Cluster

Overview

In collaboration with Dr. Mert Sabuncu from Radiology, the ITS team has established the framework for a new high-performance computing (HPC) cluster dedicated to AI/ML type workflows, like training neural networks for imaging, LLMs and so on.

This cluster features high-memory nodes, Nvidia GPU servers (A100, A40 and L40), InfiniBand interconnect, and specialized storage designed for AI workloads.

AI cluster allows special time-limited projects, dealing with clinical data. Resources for such projects are granted upon special requests, contact scu@med.cornell.edu for more information about this.

Login to the AI cluster

AI cluster is accessible via terminal SSH sessions. You need to be connecting from the WCM network, or have VPN installed and enabled. Replace <cwid> with your credentials.

ssh <cwid>@ai-login01.med.cornell.edu
# or
ssh <cwid>@ai-login02.med.cornell.edu

Once logged on:

Last login: Fri Jan  3 11:35:53 2025 from 157.000.00.00
<cwid>@ai-login01:~$
<cwid>@ai-login01:~$ pwd
/home/<cwid>
<cwid>@ai-login01:~$

Storage

AI cluster has the following storage systems configured:

Name	Mount point	Size	Use	Is backed up?	Comment

Name

Mount point

Size

Use

Is backed up?

Comment

Home

/home

2Tb

home filesystem. Used to keep small files, configs, codes, scripts, etc

no

has limited space. It is only used for small files

Midtier

/midtier/<labname>

varies per lab

each lab has an allocation under/midtier/<labname>/scratch/<cwid>

intended for data that is actively being used or processed, research datasets

no

AI GPFS

/bhii

700Tb

tbd

no

Parallel file system for data intensive workloads. Limited access, granted on special requests.

Software applications

Access to applications is managed with modules. Refer to modules page for detailed tutorial on modules but here is a quick list of commands that can be used on the AI cluster:

# list all the available modules:
module avail
# list currently loaded modules:
module load <module_name>
# unload the module:
module unload <module_name>
# swap versions of the application
module swap <module_name>/<version1> <module_name>/<version2>
# unload all modules
module purge
# get help
module help
# get more info for a particular module
module help <module_name>

If you can’t find an application that you need listed in the module avail command, contact scu@med.cornell.edu and request it to be installed on the cluster

Running jobs

Computational jobs on the AI cluster are managed with a SLURM job manager. We provide an in-depth tutorial on how to use SLURM <placeholder>, but some basic examples that are immediately applicable on the AI cluster will be discussed in this section.

Important notice

Do not run computations on login nodes

Running your application code directly without submitting it through the scheduler is prohibited. Login nodes are shared resources and they are reserved for light tasks like file management and job submission. Running heavy computations on login nodes can degrade performance for all users. Instead, please submit your compute jobs to the appropriate SLURM queue, which is designed to handle such workloads efficiently.

Batch vs interactive jobs

There are two mechanisms to run SLURM jobs: “batch” and “interactive”. Interactive jobs are an inefficient way to utilize the cluster. By their nature, these jobs require the system to wait for user input, leaving the allocated resources idle during those periods. Since HPC clusters are designed to maximize resource utilization and efficiency, having nodes sit idle while still consuming CPU, memory, or GPU resources is counterproductive.

For example:

If you're running an interactive session and step away or take time to analyze output, the allocated resources remain reserved but unused.
This idle time adds up across multiple users, leading to significant under utilization of the cluster.

Because of these considerations we highly recommend that users execute as much of their computations in the batch mode so that WCM’s research community can make the most of the cluster's capabilities.

Jobs Preemption

Understanding Job Preemption

Job preemption helps us make the most of our AI Cluster by allowing idle compute nodes to be used by other researchers. This means if a lab's dedicated nodes aren't currently in use by its members, others can run computations on them.

What are Preemptible Jobs?

Preemptible jobs are tasks that can run on nodes outside your primary lab group, utilizing available resources across the cluster. These jobs are ideal for:

Quick tests and debugging.
Short computations.
Jobs that can save their progress (checkpoint) and resume later if interrupted.

The key thing to remember is: preemptible jobs can be cancelled (preempted) if the hardware owner needs their resources back.

How to Run Preemptible Jobs

To submit a job that can run on available nodes belonging to other labs, you'll use specific SLURM directives in your submission script:

#SBATCH --partition preempt_cpu # (or preempt_gpu) if you need GPUs
#SBATCH --qos low

First parameter specifies the partition. Instead of running in your designated partition, you can run in the “preemptible” that is mapped to all the nodes of the cluster. Second parameter specifies the “quality of service” for your job. This assigns your job a lower priority, allowing it to use idle resources without interfering with higher-priority jobs from hardware owners.

Example Scenario

Let's say you're a member of "testlab1". All nodes in your "testlab1" partition are busy with your lab's jobs. However, you see, for example, that "testlab2" has two idle nodes, and "testlab3" has one. You can submit a job that runs on one of the idle nodes belonging to "testlab2" or "testlab3".

Guaranteed Run Time: This preemptible job is guaranteed to run for a minimum period, known as PreemptExemptTime. Currently, this is set to 1 hour.

After Guaranteed Time: Once this 1-hour PreemptExemptTime has passed, your preemptible job can be cancelled at any moment if a higher-priority job needs the resources.

Imagine your preemptible job is running on a "testlab2" node.

If a member of "testlab2" submits a job to their own "testlab2" partition: Their job will have a higher QoS (e.g., "high"). If your preemptible job has already run for at least 1 hour (PreemptExemptTime), it will be cancelled to make way for the "testlab2" member's job.
If your job has run for less than 1 hour: It will continue to run until it reaches the PreemptExemptTime, even if a "testlab2" member submits a job. The "testlab2" job will wait.

On the other hand, imagine that a researcher outside your lab is running a preemptible job on a node, that belongs to your lab “testlab1”. Your job may be blocked by their job for the duration of PreemptExemptTime. But as soon as their job reaches this threshold, it will be killed and your job will start.

Why Do We Use Preemption?

This system helps us:

Minimize idle time on compute nodes.
Increase overall cluster utilization, allowing more research to get done.

By using preemptible partitions, you gain access to more resources for short or flexible tasks, while ensuring that lab members can always access their dedicated hardware when needed (after a potential brief wait for the PreemptExemptTime).

Code example

To illustrate how to run computational jobs, consider the following toy problem, that is implemented in C. It estimates value of π using a random sampling method:

Generate random points (x, y) in a unit square (0 ≤ x, y ≤ 1).
Count how many points fall inside the quarter circle (x² + y² ≤ 1).
The ratio of points inside the circle to total points approximates π: π≈4 × total points / points inside circle

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

long long monte_carlo_pi(long long num_samples, int num_threads) {
    long long inside_circle = 0;
    #pragma omp parallel num_threads(num_threads)
    {
        unsigned int seed = 1234 + omp_get_thread_num();  // Unique seed for each thread
        long long local_count = 0;
        #pragma omp for
        for (long long i = 0; i < num_samples; i++) {
            double x = (double)rand_r(&seed) / RAND_MAX;
            double y = (double)rand_r(&seed) / RAND_MAX;
            if (x * x + y * y <= 1.0) {
                local_count++;
            }
        }
        #pragma omp atomic
        inside_circle += local_count;
    }
    return inside_circle;
}

int main(int argc, char *argv[]) {
    if (argc != 3) {
        printf("Usage: %s <num_samples> <num_threads>\n", argv[0]);
        return 1;
    }
    long long num_samples = atoll(argv[1]);  // Number of random points
    int num_threads = atoi(argv[2]);         // Number of OpenMP threads
    double start_time = omp_get_wtime();
    long long inside_circle = monte_carlo_pi(num_samples, num_threads);
    double end_time = omp_get_wtime();
    double pi_approx = 4.0 * (double)inside_circle / num_samples;

    printf("Approximated π: %.15f\n", pi_approx);
    printf("Error: %.15f\n", fabs(pi_approx - 3.141592653589793));
    printf("Execution Time: %.6f seconds\n", end_time - start_time);
    return 0;
}

Save this code into a file code.c and compile using a command

gcc -fopenmp code.c

It will generate an executable a.out that we will use to illustrate how to submit jobs. This code accepts two arguments: number of Monte Carlo samples, and number of parallel threads to be used and can be executed as ./a.out 1000 8 (1000 samples, run with 8 parallel threads).

SLURM Batch job example

Here is an example of the batch script that can be ran on the cluster. In this script we are requesting a single node, with 4 CPU cores, used in an SMP mode.

#!/bin/bash

#SBATCH --job-name=<jobname> # give your job a name
#SBATCH --nodes=1            # asking for 1 compute node
#SBATCH --ntasks=1           # 1 task
#SBATCH --cpus-per-task=4    # 4 CPU cores per task, so 4 cores in total         
#SBATCH --time=00:30:00      # set this time according to your need, 30 minutes here
#SBATCH --mem=8GB            # request appropriate amount of RAM
##SBATCH --gres=gpu:1        # if you need to use a GPU, note that this line is commented out
#SBATCH -p <partition_name>  # specify your partition

cd <code_directory>
export OMP_NUM_THREADS=4    # OMP_NUM_THREADS should be equal to the number of cores you are requesting
./a.out 1000000000 4        # running 1000000000 samples on 4 cores

Submit this job to the queue and inspect the output once it’s finished:

sbatch script_name

You can view your jobs in the queue with:

squeue -u <cwid>

Try to run a few jobs using different number of cores and see how it scales almost linearly.

The more resources you request from SLURM, the harder it will be for SLURM to allocate space for your job. For parallel jobs, the more CPUs, the faster it runs, but the job may be stuck in the queue for longer. Be aware of this trade-off, there is no universal answer on what’s the best strategy, it usually depends on what kind of resources your particular job needs and how busy is the cluster at the moment.

SLURM interactive job example

Even though interactive jobs are inefficient are not recommended, sometimes there is no other way to do certain things. If you need to run an interactive job, here is how it can be done:

srun --nodes 1 \
    --tasks-per-node 1 \
    --cpus-per-task 4 \
    --partition=<partition_name> \
    --gres=gpu:1 \
    --pty /bin/bash -i

Once the job is successfully started, you will be dropped into interactive BASH session on one of the compute nodes. The same scheduling considerations apply here – the more resources you are requesting, the longer is the potential wait time.

Once you are done with your interactive work, simply run exit command. It will kill your bash process and therefore the whole SLURM job will be cancelled.

Jupyter job

These instructions will allow users to launch jupyter notebook server on the AI cluster’s compute nodes and connect to this server using local browser. Similar approach can be used to connect to other services.

Jupyter jobs are interactive by nature, so all the consideration about interactive jobs and their inefficiencies apply here

Prepare python environment (using conda or pip) on the cluster:

working with conda

load python module
# list existing modules with module avail # load suitable python module (miniconda in this example) module load miniconda3-4.10.3-gcc-12.2.0-hgiin2a
create new environment
conda create -n myvenv python=3.10 # and follow prompts...
once this new environment is created, activate it with
# initialize conda conda init bash # or "conda init csh" if csh is used # list existing environments conda env list # activate newly installed env from previous step conda activate myvenv # if you need to deactivate conda deactivate myvenv
After all this is done, conda will automatically load ~base~ environment upon every login to the cluster. To prevent this, run
conda config --set auto_activate_base false
Use this env to install jupyter (and other required packages)
conda install jupyter

working with pip

load python module
# list existing modules with module avail # load suitable python module (miniconda in this example) module load miniconda3-4.10.3-gcc-12.2.0-hgiin2a
Create new environment and activate it
# create venv python -m venv ~/myvenv # and activate it source ~/myvenv/bin/activate
Install jupyter into this new virtual environments:
pip install jupyter

Submit a SLURM job

Once conda is installed, submit a SLURM job. Prepare this SLURM batch script similar to this (use your own SBATCH arguments, like job name and the amount of resources you need, this is only an example):

#!/bin/bash

#SBATCH --job-name=<myJobName> # give your job a name
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:30:00 # set this time according to your need
#SBATCH --mem=3GB # how much RAM will your notebook consume?
#SBATCH --gres=gpu:1 # if you need to use a GPU
#SBATCH -p ai-gpu # specify partition

module purge
module load miniconda3-4.10.3-gcc-12.2.0-hgiin2a
# if using conda
conda activate myvenv
# if using pip
# source ~/myvev/bin/activate

# set log file
LOG="/home/${USER}/jupyterjob_${SLURM_JOB_ID}.txt"

# gerenerate random port
PORT=`shuf -i 10000-50000 -n 1`

# print useful info
cat << EOF > ${LOG}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~ Slurm Job $SLURM_JOB_ID
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Hello from the Jupyter job!

In order to connect to this jupyter session
setup a tunnel on your local workstation with:
     --->  ssh -t ${USER}@ai-login01.med.cornell.edu -L ${PORT}:localhost:${PORT} ssh ${HOSTNAME} -L ${PORT}:localhost:${PORT}
(copy above command and paste it to your terminal).

Depending on your ssh configuration, you may be
prompted for your password. Once you are logged in,
leave the terminal running and don't close it until
you are finished with your Jupyter session.

Further down look for a line similar to
     ---> http://127.0.0.1:10439/?token=xxxxyyyxxxyyy
Copy this line and paste in your browser
EOF

# start jupyter
jupyter-notebook --no-browser --ip=0.0.0.0 --port=${PORT} 2>&1 | tee -a ${LOG}

Save this file somewhere in your file space on the login node and submit a batch SLURM job with

sbatch script_name.txt

Set up connection to the jupyter servers

SLURM job will generate a text file ~/jupyterjob_<JOBID>.txt. Follow instructions in this file to connect to the jupyter session. Two steps that need to be take are:

Setup an SSH tunnel
Point your browser to 127.0.0.1:port/token=...

Once you are done with your jupyter work, save your progress if needed, close browser tabs and make sure to stop the SLURM job with scancel <jobid>

Stopping and monitoring SLURM jobs

To stop (cancel) a SLURM job use

scancel <job_id>

Once the job is running, there are a few tool that can help monitoring the status. Again, refer to <placeholder> for detailed SLURM tutorial, but here is a list of some useful commands:

# show status of the queue
squeue -l                      
# only list jobs by a specific user
squeue -l -u <cwid>            
# print partitions info
sinfo                          
# print detailed info about a job
scontrol show job <job id>     
# print detailed info about a job
scontrol show node <node_name> 
# get a list of all the jobs executed within last 7 days:
sacct -u <cwid> -S $(date -d "-7 days" +%D) -o "user,JobID,JobName,state,exit"