AI Cluster
Overview
In collaboration with Dr. Mert Sabuncu from Radiology, the ITS team has established the framework for a new high-performance computing (HPC) cluster dedicated to AI/ML type workflows, like training neural networks for imaging, LLMs and so on.
This cluster features high-memory nodes, Nvidia GPU servers (A100, A40 and L40), InfiniBand interconnect, and specialized storage designed for AI workloads.
AI cluster allows special time-limited projects, dealing with clinical data. Resources for such projects are granted upon special requests, contact scu@med.cornell.edu for more information about this.
Login to the AI cluster
AI cluster is accessible via terminal SSH sessions. You need to be connecting from the WCM network, or have VPN installed and enabled. Replace <cwid>
with your credentials.
ssh <cwid>@ai-login01.med.cornell.edu
# or
ssh <cwid>@ai-login02.med.cornell.edu
Once logged on:
Last login: Fri Jan 3 11:35:53 2025 from 157.000.00.00
<cwid>@ai-login01:~$
<cwid>@ai-login01:~$ pwd
/home/<cwid>
<cwid>@ai-login01:~$
Storage
AI cluster has the following storage systems configured:
Name | Mount point | Size | Use | Is backed up? | Comment |
---|---|---|---|---|---|
Home |
| 2Tb | home filesystem. Used to keep small files, configs, codes, scripts, etc | no | has limited space. It is only used for small files |
Midtier |
| varies per lab | each lab has an allocation under intended for data that is actively being used or processed, research datasets | no |
|
AI GPFS |
| 700Tb | tbd | no | Parallel file system for data intensive workloads. Limited access, granted on special requests. |
Software applications
Access to applications is managed with modules
. Refer to <placeholder> for detailed tutorial on modules
but here is a quick list of commands that can be used on the AI cluster:
# list all the available modules:
module avail
# list currently loaded modules:
module load <module_name>
# unload the module:
module unload <module_name>
# swap versions of the application
module swap <module_name>/<version1> <module_name>/<version2>
# unload all modules
module purge
# get help
module help
# get more info for a particular module
module help <module_name>
If you can’t find an application that you need listed in the module avail
command, contact scu@med.cornell.edu and request it to be installed on the cluster
Running jobs
Computational jobs on the AI cluster are managed with a SLURM job manager. We provide an in-depth tutorial on how to use SLURM <placeholder>, but some basic examples that are immediately applicable on the AI cluster will be discussed in this section.
Important notice
Do not run computations on login nodes
Running your application code directly without submitting it through the scheduler is prohibited. Login nodes are shared resources and they are reserved for light tasks like file management and job submission. Running heavy computations on login nodes can degrade performance for all users. Instead, please submit your compute jobs to the appropriate SLURM queue, which is designed to handle such workloads efficiently.
Batch vs interactive jobs
There are two mechanisms to run SLURM jobs: “batch” and “interactive”. Interactive jobs are an inefficient way to utilize the cluster. By their nature, these jobs require the system to wait for user input, leaving the allocated resources idle during those periods. Since HPC clusters are designed to maximize resource utilization and efficiency, having nodes sit idle while still consuming CPU, memory, or GPU resources is counterproductive.
For example:
If you're running an interactive session and step away or take time to analyze output, the allocated resources remain reserved but unused.
This idle time adds up across multiple users, leading to significant underutilization of the cluster.
Because of these considerations we highly recommend that users execute as much of their computations in the batch mode so that WCM’s research community can make the most of the cluster's capabilities.
Code example
To illustrate how to run computational jobs, consider the following toy problem, that is implemented in C. It estimates value of π using a random sampling method:
Generate random points (x, y) in a unit square (0 ≤ x, y ≤ 1).
Count how many points fall inside the quarter circle (x² + y² ≤ 1).
The ratio of points inside the circle to total points approximates π: π≈4 × total points / points inside circle
Save this code into a file code.c
and compile using a command
It will generate an executable a.out
that we will use to illustrate how to submit jobs. This code accepts two arguments: number of Monte Carlo samples, and number of parallel threads to be used and can be executed as ./a.out 1000 8
(1000 samples, run with 8 parallel threads).
SLURM Batch job example
Here is an example of the batch script that can be ran on the cluster. In this script we are requesting a single node, with 4 CPU cores, used in an SMP mode.
Submit this job to the queue and inspect the output once it’s finished:
You can view your jobs in the queue with:
Try to run a few jobs using different number of cores and see how it scales almost linearly.
SLURM interactive job example
Even though interactive jobs are inefficient are not recommended, sometimes there is no other way to do certain things. If you need to run an interactive job, here is how it can be done:
Once the job is successfully started, you will be dropped into interactive BASH session on one of the compute nodes. The same scheduling considerations apply here – the more resources you are requesting, the longer is the potential wait time.
Once you are done with your interactive work, simply run exit
command. It will kill your bash process and therefore the whole SLURM job will be cancelled.
Jupyter job
These instructions will allow users to launch jupyter notebook server on the AI cluster’s compute nodes and connect to this server using local browser. Similar approach can be used to connect to other services.
Prepare python environment (using conda or pip) on the cluster:
working with conda
load python module
create new environment
once this new environment is created, activate it with
After all this is done, conda will automatically load ~base~ environment upon every login to the cluster. To prevent this, run
Use this env to install jupyter (and other required packages)
working with pip
load python module
Create new environment and activate it
Install jupyter into this new virtual environments:
Submit a SLURM job
Once conda is installed, submit a SLURM job. Prepare this SLURM batch script similar to this (use your own SBATCH arguments, like job name and the amount of resources you need, this is only an example):
Save this file somewhere in your file space on the login node and submit a batch SLURM job with
Set up connection to the jupyter servers
SLURM job will generate a text file ~/jupyterjob_<JOBID>.txt
. Follow instructions in this file to connect to the jupyter session. Two steps that need to be take are:
Setup an SSH tunnel
Point your browser to
127.0.0.1:port/token=...
Stopping and monitoring SLURM jobs
To stop (cancel) a SLURM job use
Once the job is running, there are a few tool that can help monitoring the status. Again, refer to <placeholder> for detailed SLURM tutorial, but here is a list of some useful commands: