Content Comparison

Table of Contents

minLevel	2
maxLevel	6
outline	false
style	decimal
type	list
printable	true

...

This cluster features high-memory nodes, Nvidia GPU servers (A100, A40 and L40), InfiniBand interconnect, and specialized storage designed for AI workloads.

Info
AI cluster allows special time-limited projects, dealing with clinical data. Resources for such projects are granted upon special requests, contact scu@med.cornell.edu for more information about this.

Login to the AI cluster

AI cluster is accessible via terminal SSH sessions. You need to be connecting from the WCM network, or have VPN installed and enabled. Replace <cwid> with your credentials.

...

Name

Mount point

Size

Use

Is backed up?

Comment

Home

/home

2Tb

home filesystem. Used to keep small files, configs, codes, scripts, etc

no

have has limited space. It is only used for small files

Midtier

/midtier/<labname>

varies per lab

each lab has an allocation under/midtier/<labname>/scratch/<cwid>

intended for data that is actively being used or processed, research datasets

no

AI GPFS

/bhii

700Tb

tbd

no

Parallel file system for data intensive workloads. Limited access, granted on special requests.

Common File Management

...

language	bash

...

Software applications

Access to applications is managed with modules. Refer to <placeholder> for detailed tutorial on modules but here is a quick list of commands that can be used on the AI cluster:

Code Block

language	bash

# list all the available modules:
module avail
# list currently loaded modules:
module load <module_name>
# unload the module:
module unload <module_name>
# swap versions of the application
module swap <module_name>/<version1> <module_name>/<version2>
# unload all modules
module purge
# get help
module help
# get more info for a particular module
module help <module_name>

Info
If you can’t find an application that you need listed in the `module avail` command, contact scu@med.cornell.edu and request it to be installed on the cluster

Running jobs

Computational jobs on the AI cluster are managed with a SLURM job manager. We provide an in-depth tutorial on how to use SLURM <placeholder>, but some basic examples that are immediately applicable on the AI cluster will be discussed in this section.

Important notice

...

Warning
Do not run computations on login nodes
Running your application code directly without submitting it through the scheduler is prohibited. Login nodes are shared resources and they are reserved for light tasks like file management and job submission. Running heavy computations on login nodes can degrade performance for all users. Instead, please submit your compute jobs to the appropriate SLURM queue, which is designed to handle such workloads efficiently.

Batch vs interactive jobs

There are two mechanisms to run SLURM jobs: “batch” and “interactive”. Interactive jobs are an inefficient way to utilize the cluster. By their nature, these jobs require the system to wait for user input, leaving the allocated resources idle during those periods. Since HPC clusters are designed to maximize resource utilization and efficiency, having nodes sit idle while still consuming CPU, memory, or GPU resources is counterproductive.

...

Note
Once you are done with your jupyter work, save your progress if needed, close browser tabs and make sure to stop the SLURM job with `scancel <jobid>`

Stopping and monitoring SLURM jobs

To stop (cancel) a SLURM job use

Code Block

language	bash

scancel <job_id>

Once the job is running, there are a few tool that can help monitoring the status. Again, refer to <placeholder> for detailed SLURM tutorial, but here is a list of some useful commands:

Code Block

language	bash

# show status of the queue
squeue -l                      
# only list jobs by a specific user
squeue -l -u <cwid>            
# print partitions info
sinfo                          
# print detailed info about a job
scontrol show job <job id>     
# print detailed info about a job
scontrol show node <node_name> 
# get a list of all the jobs executed within last 7 days:
sacct -u <cwid> -S $(date -d "-7 days" +%D) -o "user,JobID,JobName,state,exit"

Version	Old Version 32	New Version 41
Changes made by	eud4002	eud4002
Saved on	Jan 30, 2025	Jan 31, 2025

Versions Compared

Key

Login to the AI cluster

Software applications

Running jobs

Important notice

Batch vs interactive jobs

Stopping and monitoring SLURM jobs