...
Once this is done, you'll need to setup your environment to use Spack, our cluster-wide package manager. This is described here: Spack
Note, we now have a new Spack environment at /software/spack/centos7/share/spack/setup-env.sh so source that file and do not use the older one.
...
Relion-specific Environment Setup
...
- Under the "Running" tab for a given calculation, set the option "Submit to queue?" to "Yes".
- You'll need to set "Queue name:" to the Slurm partition you wish to use:
- If you want to only run on CPUs, then set this to "cryo-cpu"
- If you want to run on the shared cryoEM GPUs, then set this to "cryo-gpu"
- If you want to run on a lab-reserved node, then set this to the appropriate partition (e.g. "blanchard_reserve")
- Set "Queue submit command" to "sbatch"
- Set the number of nodes/GPUs to the desired values
Relion requires a set of template Slurm submission scripts that enables it to submit jobs. You'll need to set "Standard submission script" to the path where the needed template script is. A set of template submission scripts, which are no longer maintained, can be found here:
Code Block /softlib/apps/EL7/slurm_relion_submit_templates # select the template script that is appropriate for your job
- Give the "Current job" an alias if desired and click "Run!" # assumes everything else is set. Checking job status (and other Slurm functionality) is described here and here.
...
Code Block |
---|
#!/bin/bash #SBATCH --job-name=n1_bench #SBATCH -p cryo-gpu-v100 #SBATCH --mem=170g #SBATCH --nodes=1 #SBATCH --ntasks=5 #SBATCH --ntasks-per-node=5 #SBATCH --cpus-per-task=4 #SBATCH --gres=gpu:4 cd /athena/scu/scratch/dod2014/relion_2 source /software/spack/centos7/share/spack/setup-env.sh # 3.1_beta skylake openmpi 4.0.1 spack load -r /ii7uzb5 # 3.0.8 w openmpi 4 + slurm #spack load -r /sfp6sf5 # 3.1_beta w openmpi 4 + slurm #spack load -r /ii7uzb5 mkdir -pv Refine3D/quackmaster/run_single_node/${user}_${SLURM_JOB_ID} mpirun -display-allocation -display-map -v -np 5 relion_refine_mpi --o Refine3D/quackmaster/run_single_node/${user}_${SLURM_JOB_ID} --split_random_halves --i Select/job088/particles.star --ref 012218_CS_256.mrc --firstiter_cc --ini_high 30 --dont_combine_weights_via_disc --scratch_dir /scratchLocal --pad 2 --ctf --particle_diameter 175 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 4 --gpu --pool 100 --auto_refine # report whether relion crashes # remove particles from scratch # if it crashes if [ $? -eq 0 ] then echo -e "$SLURM_JOB_ID exited successfully" else user=$SLURM_JOB_USER MY_TMP_DIR=/scratchLocal/${user}_${SLURM_JOB_ID} echo -e "$SLURM_JOB_ID failed so cleaning up $MY_TMP_DIR" rm -rf $MY_TMP_DIR fi |
Multi node jobs:
If you want your job to finish sooner, and there are idle nodes, sooner then you can run a single job on multiple nodes at once. This Through the magic of OpenMPI and low-latency RDMA networking this will allow each iteration, and the entire job, to finish sooner than it would on a single node.
...
To achieve this goal change the Slurm submission options and Relion command line arguments. For example the above script, for a single node, can be submitted to 3 nodes with the following changes:
Code Block |
---|
#!/bin/bash report whether relion crashes # remove particles from scratch # if it crashes if [ $? -eq 0 ] then echo -e "$SLURM_JOB_ID exited successfully" else user=$SLURM_JOB_USER MY_TMP_DIR=/scratchLocal/${user}_${SLURM_JOB_ID} echo -e "$SLURM_JOB_ID failed so cleaning up $MY_TMP_DIR" rm -rf $MY_TMP_DIR fi#!/bin/bash #SBATCH --#SBATCH --job-name=n3_bench #SBATCH -p cryo-gpu-v100 #SBATCH --mem=170g #SBATCH --nodes=3 #SBATCH --ntasks=13 #SBATCH --ntasks-per-node=5 #SBATCH --cpus-per-task=4 #SBATCH --gres=gpu:4 mpirun -display-allocation -display-map -v -n 13 relion_refine_mpi --o Refine3D/quackmaster/run_multi_node/${user}_${SLURM_JOB_ID} --split_random_halves --i Select/job088/particles.star --ref 012218_CS_256.mrc --firstiter_cc --ini_high 30 --dont_combine_weights_via_disc --scratch_dir /scratchLocal --pad 2 --ctf --particle_diameter 175 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 6 --gpu --pool 100 --auto_refine |
...
In the above notice --ntasks=13 and mpirun -n 13 options.
These will ensure 13 total MPI processes are launched, across the 3 nodes in cryo-gpu-v100 partition, with 5 processes on the first node and 4 on the remaining. Note, you ideally want 1 with the exception of the first node that will coordinate MPI communication across all nodes, you want 4 MPI process per node given each node has 4 GPUs. Any more than 1 4 MPI process per node , thus GPU, will result in poor performance as the GPUs will be oversubscribed.
If you wanted to run this on 2 nodes instead of three then use --ntasks=9 as well as mpirun -n 9.
In Slurm log output we can see the allocation, with three nodes, and the correct number of MPI processes:
...
Code Block |
---|
# report whether relion crashes # remove particles from scratch # if it crashes if [ $? -eq 0 ] then echo -e "$SLURM_JOB_ID exited successfully" else user=$SLURM_JOB_USER MY_TMP_DIR=/scratchLocal/${user}_${SLURM_JOB_ID} echo -e "$SLURM_JOB_ID failed so cleaning up $MY_TMP_DIR" rm -rf $MY_TMP_DIR fi |
...
Related articles
...