Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents


This cluster features high-memory nodes, Nvidia GPU servers (A100, A40 and L40), InfiniBand interconnect, and specialized storage designed for AI workloads.


AI cluster allows special time-limited projects, dealing with clinical data. Resources for such projects are granted upon special requests, contact for more information about this.

Login to the AI cluster

AI cluster is accessible via terminal SSH sessions. You need to be connecting from the WCM network, or have VPN installed and enabled. Replace <cwid> with your credentials.



Mount point



Is backed up?





home filesystem. Used to keep small files, configs, codes, scripts, etc


have has limited space. It is only used for small files



varies per lab

each lab has an allocation under/midtier/<labname>/scratch/<cwid>

intended for data that is actively being used or processed, research datasets







Parallel file system for data intensive workloads. Limited access, granted on special requests.

Common File Management




Software applications

Access to applications is managed with modules. Refer to <placeholder> for detailed tutorial on modules but here is a quick list of commands that can be used on the AI cluster:

Code Block
# list all the available modules:
module avail
# list currently loaded modules:
module load <module_name>
# unload the module:
module unload <module_name>
# swap versions of the application
module swap <module_name>/<version1> <module_name>/<version2>
# unload all modules
module purge
# get help
module help
# get more info for a particular module
module help <module_name>

If you can’t find an application that you need listed in the module avail command, contact and request it to be installed on the cluster

Running jobs

Computational jobs on the AI cluster are managed with a SLURM job manager. We provide an in-depth tutorial on how to use SLURM <placeholder>, but some basic examples that are immediately applicable on the AI cluster will be discussed in this section.

Important notice



Do not run computations on login nodes

Running your application code directly without submitting it through the scheduler is prohibited. Login nodes are shared resources and they are reserved for light tasks like file management and job submission. Running heavy computations on login nodes can degrade performance for all users. Instead, please submit your compute jobs to the appropriate SLURM queue, which is designed to handle such workloads efficiently.

Batch vs interactive jobs

There are two mechanisms to run SLURM jobs: “batch” and “interactive”. Interactive jobs are an inefficient way to utilize the cluster. By their nature, these jobs require the system to wait for user input, leaving the allocated resources idle during those periods. Since HPC clusters are designed to maximize resource utilization and efficiency, having nodes sit idle while still consuming CPU, memory, or GPU resources is counterproductive.


Try to run a few jobs using different number of cores and see how it scales almost linearly.


The more resources you request from SLURM, the harder it will be for SLURM to allocate space for your job. For parallel jobs, the more CPUs, the faster it runs, but the job may be stuck in the queue for longer. Be aware of this trade-off, there is no universal answer on what’s the best strategy, it usually depends on what kind of resources your particular job needs and how busy is the cluster at the moment.


These instructions will allow users to launch jupyter notebook server on the AI cluster’s compute nodes and connect to this server using local browser. Similar approach can be used to connect to other services.



Jupyter jobs are interactive by nature, so all the consideration about interactive jobs and their inefficiencies apply here

Prepare python environment (using conda or pip) on the cluster:


  1. load python module

    Code Block
    # list existing modules with 
    module avail
    # load suitable python module (miniconda in this example)
    module load miniconda3-4.10.3-gcc-12.2.0-hgiin2a
  2. Create new environment and activate it

    Code Block
    # create venv
    python -m venv ~/myvenv
    # and activate it
    source ~/myvenv/bin/activate
  3. Install jupyter into this new virtual environments:

    Code Block
    pip install jupyter

Submit a



Once conda is installed, submit a SLURM job. Prepare this SLURM batch script similar to this (use your own SBATCH arguments, like job name and the amount of resources you need, this is only an example):


Save this file somewhere in your file space on the login node and submit a batch slurm SLURM job with

Code Block
sbatch script_name.txt 

Set up connection to the jupyter servers

Slurm SLURM job will generate a text file ~/jupyterjob_<JOBID>.txt. Follow instructions in this file to connect to the jupyter session. Two steps that need to be take are:



Once you are done with your jupyter work, save your progress if needed, close browser tabs and make sure to stop the SLURM job with scancel <jobid>

Stopping and monitoring SLURM jobs

To stop (cancel) a SLURM job use

Code Block
scancel <job_id>

Once the job is running, there are a few tool that can help monitoring the status. Again, refer to <placeholder> for detailed SLURM tutorial, but here is a list of some useful commands:

Code Block
# show status of the queue
squeue -l                      
# only list jobs by a specific user
squeue -l -u <cwid>            
# print partitions info
# print detailed info about a job
scontrol show job <job id>     
# print detailed info about a job
scontrol show node <node_name> 
# get a list of all the jobs executed within last 7 days:
sacct -u <cwid> -S $(date -d "-7 days" +%D) -o "user,JobID,JobName,state,exit"