Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
outlinetrue


...


Note

The purpose of this documentation is to describe how to run Relion calculations on SCU resources, including some general suggestions regarding how to achieve better performance. It is NOT intended to teach you how to run Relion–there is already an excellent Relion tutorial that serves this purpose.


...

General Environment Setup to Work with SCU Resources

...

Once this is done, you'll need to setup your environment to use Spack, our cluster-wide package manager. This is described here: Spack


Note, we now have a new Spack environment at /software/spack/centos7/share/spack/setup-env.sh so source that file and do not use the older one.


...

Relion-specific Environment Setup

...

Code Block
languagebash
export RELION_QSUB_EXTRA_COUNT=2
export RELION_QSUB_EXTRA1="Number of nodes:"
export RELION_QSUB_EXTRA1_DEFAULT="1"
export RELION_QSUB_EXTRA2="Number of GPUs:"
export RELION_QSUB_EXTRA2_DEFAULT="0"
export RELION_CTFFIND_EXECUTABLE=/softlib/apps/EL7/ctffind/ctffind-4.1.8/bin/ctffind


After editing your ~/.bashrc, log out of curie.

...

Info

If you plan on running calculations on a desktop/workstation (e.g. a system not allocated by Slurm), then you'll need to log in there (make sure you use -Y in each of your ssh commands)


For Cluster/reserved nodes-only: to request an interactive session, use this command (for more information, see: Using Slurm)

Code Block
srun -n1 --pty --x11 --partition=cryo-cpu --mem=8G bash -l

Seeing which Relion versions are available

To see what's available, use this command (for more information on spack command, see Spack):

...

Here is output from the above command that is current as of 511/311/19:

Code Block
[root@node175 ~]# spack find -l -v relion cuda_arch=60
==> 143 installed packages.
-- linux-centos7-x86_64broadwell / gcc@4gcc@8.82.50 -----------------------------
pbsqju2citbw3p relion@2relion@3.0.37 build_type=RelWithDebInfo ~cuda+cuda cuda_arch=60 +double~double-gpu+gui purpose=cluster ua3zs52 qgb6abl relion@2.0.3relion@3.1_beta build_type=RelWithDebInfo +cuda cuda_arch=60 +double~double-gpu+gui purpose=cluster
djy46i6iyusobn relion@2relion@3.0.18 build_type=RelWithDebInfo +cluster+cuda cuda_arch=60 ~desktop+double~double-gpu+gui cchnbyc relion@2.1 build_type=RelWithDebInfo ~cuda cuda_arch= +double~double-gpu+gui
xxisr7j relion@2.1 build_type=RelWithDebInfo +cudapurpose=cluster


[root@node175 ~]# spack find -l -v relion cuda_arch=60 +desktop+double~double-gpu+gui
lzd4ktq relion@2.1 build_type=RelWithDebInfo +cuda cuda_arch=60 +double~double-gpu+gui
v6jckz3 relion@2.170
==> 3 installed packages
-- linux-centos7-skylake_avx512 / gcc@8.2.0 ---------------------
gzhr4k3 relion@3.0.7 build_type=RelWithDebInfo +cuda cuda_arch=6070 +double~double-gpu+gui purpose=cluster 6cpdlsc 5dknaqs relion@2relion@3.1_beta build_type=RelWithDebInfo +cuda cuda_arch=6070 +double~double-gpu+gui purpose=desktopcluster
ltaf3x6ar4poio relion@2relion@3.0.18 build_type=RelWithDebInfo +cuda cuda_arch=70 +double~double-gpu+gui purpose=cluster
u6dzm4v relion@3.0_beta build_type=RelWithDebInfo ~cuda cuda_arch= +double~double-gpu+gui purpose=cluster
olcttts relion@3.0_beta


[root@node175 ~]# spack find -l -v relion~cuda
==> 3 installed packages
-- linux-centos7-broadwell / gcc@8.2.0 --------------------------
xy5wtb3 relion@3.0.7 build_type=RelWithDebInfo +cuda~cuda cuda_arch=60none +double~double-gpu+gui purpose=cluster f2fjevh 7juireg relion@3.01_beta build_type=RelWithDebInfo +cuda~cuda cuda_arch=60none +double~double-gpu+gui purpose=desktopcluster
opdhbcsbytbti4 relion@3.0_beta.8 build_type=RelWithDebInfo +cuda~cuda cuda_arch=70none +double~double-gpu+gui purpose=cluster
2ilkf4r
relion@develop
build_type=RelWithDebInfo +cuda+double~double-gpu+gui


There's a lot in the above output, so let's break it down!

First, in the above, we have several production-ready Relion versions:

  • relion@2relion@3.0.37
  • relion@2relion@3.0.18
  • relion@3.0_beta

In addition, these versions of Relion can run on the following platforms:

  • purpose=desktop # used on workstations
  • purpose=cluster   # used on our Slurm-allocated cluster1_beta

Some of these Relion installations are intended for use on nodes/workstations with GPUs, whereas others are intended for CPU-only nodes/workstations:

  • +cuda # Relion installation that supports GPU-use
  • -cuda~cuda  # Relion installation that does not support GPU use

...

  • cuda_arch=60  # For use on nodes with Nvidia P100sP100s - cro-gpu-v100 Slurm partition
  • cuda_arch=70  # For use on nodes with Nvidia V100s - cryo-gpu Slurm partition

What Relion version should I use?!?

Which version of Relion you use (23.0.7, 3, 2.0.18, or 3.01_beta) is more of a scientific question than technical one; however, in general, most users seem to be using 3.0_beta, unless they have legacy projects that were started with an older version (consult with PI if you have questions about this). 


After the specific version of Relion is selected, selecting for the other installation parameters is straightforward:


To load Relion 3 beta the 3 version 3.0.8 cluster (or reserved nodes) on CPU-only nodes (i.e. no GPUs):

Code Block
spack load -r relion@3.0_beta~cuda purpose=cluster.8~cuda


Info

This is the command you want if you wish to launch the Relion GUI–don't worry that this version doesn't use GPUs. The GUI is simply used for creating the job and either submitting it directly to Slurm, or writing out a submission script. This Relion installation is only for running the GUI or for running jobs on CPU-only nodes.

...

Code Block
spack load -r relion@3.0_beta+cuda purpose=cluster.7 cuda_arch=60


To run relion3.0_beta on the cluster (or reserved nodes) with GPUs (V100s):

Code Block
spack load -r relion@3.0_beta+cuda purpose=cluster.8 cuda_arch=70

To run relion3.0_beta on a desktop/workstation with GPUs (P100s):

...


Note, you can also load Relion by hash


For example "spack load -r /5dknaqs" for Relion on V-100s.

...

Launching the Relion GUI


Info

Note for Cluster/reserved nodes-only: This assumes you are already in an interactive session (as described in the previous section)–if not, request an interactive session! 


If you haven't already, load relion@3relion@1.0_beta with this command:

Code Block
spack load -r relion@3.01_beta~cuda purpose=cluster


Info

This is the command you want if you wish to launch the Relion GUI–don't worry that this version doesn't use GPUs. The GUI is simply used for creating the job and either submitting it directly to Slurm, or writing out a submission script. This Relion installation is only for running the GUI or for running jobs on CPU-only nodes.


Next, change to the directory where you wish to keep your Relion files for a given project and execute this command:

Code Block
relion

This should launch the Relion GUI.

Info

Relion generates a lot of subdirectories and metadata; to keep everything organized, we recommend that each Relion project (i.e. analysis workflow for a given set of data) is given its own directory. 


...

Submitting Jobs from the GUI:

Should I run jobs locally or submit jobs to the Slurm queue?

Info

If you are running the GUI on a desktop/workstation, load the appropriate version of Relion, as described in the previous section.

Next, change to the directory where you wish to keep your Relion files for a given project and execute this command:

Code Block
relion

This should launch the Relion GUI.

...

are RUNNING jobs on a workstation (i.e. actually using your workstation to PERFORM the calculation, and not just submitting from the workstation) AND that workstation is not managed by Slurm, just run all jobs locally.


Some jobs should just be run from the interactive session, and NOT submitted to the Slurm queue. These are jobs are generally lightweight in terms of computational demand and/or require the Relion GUI. Here are a few examples of jobs that should be run from the GUI (and not submitted to the Slurm queue):

  • Import
  • Manual picking
  • Subset selection
  • Join star files


Here are some jobs that should never be run directly (i.e. these jobs should always be submitted to the Slurm queue, unless running on a local workstation):

  • Motion correction
  • CTF estimation
  • Auto-picking
  • Particle Sorting
  • 2D/3D classification
  • 3D initial model
  • 3D auto-refine
  • 3D multi-body


In general, if the job takes a long time to run or requires a lot of compute resources, it should be submitted to the Slurm queue (or run on a local workstation). 

How to run jobs locally?

This is very straight-forward. Set up your calculation in the Relion GUI (see the Relion tutorial for calculation-specific settings). Under the "Running" tab for a given calculation, just make sure the option "Submit to queue?" is set to "No". Otherwise, follow the procedure as described in the Relion tutorial.

How to submit jobs to the Slurm queue?

Set up your calculation in the Relion GUI (see the Relion Tutorial for calculation-specific settings).

  1. Under the "Running" tab for a given calculation, set the option "Submit to queue?" to "Yes". 
  2. You'll need to set "Queue name:" to the Slurm partition you wish to use:
    1. If you want to only run on CPUs, then set this to "cryo-cpu"
    2. If you want to run on the shared cryoEM GPUs, then set this to "cryo-gpu"
    3. If you want to run on a lab-reserved node, then set this to the appropriate partition (e.g. "blanchard_reserve")
  3. Set "Queue submit command" to "sbatch"
  4. Set the number of nodes/GPUs to the desired values
  5. Relion requires a set of template Slurm submission scripts that enables it to submit jobs. You'll need to set "Standard submission script" to the path where the needed template script is. A set of template submission scripts, which are no longer maintained, can be found here: 

    Code Block
    /softlib/apps/EL7/slurm_relion_submit_templates # select the template script that is appropriate for your job


  6. Give the "Current job" an alias if desired and click "Run!" # assumes everything else is set. Checking job status (and other Slurm functionality) is described here and here.


Single node jobs:


Users commonly submit a single job to one node.  An example 3D Refine script shown below:



Code Block
#!/bin/bash
#SBATCH --job-name=n1_bench
#SBATCH -p cryo-gpu-v100
#SBATCH --mem=170g
#SBATCH --nodes=1
#SBATCH --ntasks=5
#SBATCH --ntasks-per-node=5
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:4

cd /athena/scu/scratch/dod2014/relion_2

source /software/spack/centos7/share/spack/setup-env.sh

# 3.1_beta skylake openmpi 4.0.1
spack load -r /ii7uzb5

# 3.0.8 w openmpi 4 + slurm
#spack load -r /sfp6sf5

# 3.1_beta w openmpi 4 + slurm
#spack load -r /ii7uzb5

mkdir -pv Refine3D/quackmaster/run_single_node/${user}_${SLURM_JOB_ID}

mpirun  -display-allocation -display-map -v -np 5 relion_refine_mpi --o Refine3D/quackmaster/run_single_node/${user}_${SLURM_JOB_ID} --split_random_halves --i Select/job088/particles.star --ref 012218_CS_256.mrc --firstiter_cc --ini_high 30 --dont_combine_weights_via_disc --scratch_dir /scratchLocal --pad 2  --ctf --particle_diameter 175 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 4 --gpu --pool 100 --auto_refine

# report whether relion crashes
# remove particles from scratch
# if it crashes
if [ $? -eq 0 ]
  then
     echo -e "$SLURM_JOB_ID exited successfully"
  else
     user=$SLURM_JOB_USER
     MY_TMP_DIR=/scratchLocal/${user}_${SLURM_JOB_ID}
     echo -e "$SLURM_JOB_ID failed so cleaning up $MY_TMP_DIR"
     rm -rf $MY_TMP_DIR
fi


Multi node jobs:


If you want your job to finish sooner then you can run a single job on multiple nodes at once.  Through the magic of OpenMPI and low-latency RDMA networking this will allow each iteration, and the entire job, to finish sooner than it would on a single node.


To achieve this goal change the Slurm submission options and Relion command line arguments.  For example the above script, for a single node, can be submitted to 3 nodes with the following changes:



Code Block
#!/bin/bash
#SBATCH --job-name=n3_bench
#SBATCH -p cryo-gpu-v100
#SBATCH --mem=170g
#SBATCH --nodes=3
#SBATCH --ntasks=13
#SBATCH --ntasks-per-node=5
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:4

mpirun -display-allocation -display-map -v -n 13 relion_refine_mpi --o Refine3D/quackmaster/run_multi_node/${user}_${SLURM_JOB_ID} --split_random_halves --i Select/job088/particles.star --ref 012218_CS_256.mrc --firstiter_cc --ini_high 30 --dont_combine_weights_via_disc --scratch_dir /scratchLocal --pad 2  --ctf --particle_diameter 175 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 6 --gpu --pool 100 --auto_refine


In the above notice --ntasks=13 and mpirun -n 13 options.


These will ensure 13 total MPI processes are launched, across the 3 nodes in cryo-gpu-v100 partition, with 5 processes on the first node and 4 on the remaining.  Note, with the exception of the first node that will coordinate MPI communication across all nodes, you want 4 MPI process per node given each node has 4 GPUs.  Any more than 4 MPI process per node will result in poor performance as the GPUs will be oversubscribed.  


If you wanted to run this on 2 nodes instead of three then use --ntasks=9 as well as mpirun -n 9.


In Slurm log output we can see the allocation, with three nodes, and the correct number of MPI processes:


Code Block
mkdir: created directory ‘Refine3D/quackmaster/run_multi_node/_1349014’

====================== ALLOCATED NODES ======================
cantley-node01: flags=0x11 slots=5 max_slots=0 slots_inuse=0 state=UP
cantley-node02: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
node183: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
Data for JOB [8540,1] offset 0 Total slots allocated 13

======================== JOB MAP ========================

Data for node: cantley-node01 Num slots: 5 Max slots: 0 Num procs: 5
Process OMPI jobid: [8540,1] App: 0 Process rank: 0 Bound: UNBOUND
Process OMPI jobid: [8540,1] App: 0 Process rank: 1 Bound: UNBOUND
Process OMPI jobid: [8540,1] App: 0 Process rank: 2 Bound: UNBOUND
Process OMPI jobid: [8540,1] App: 0 Process rank: 3 Bound: UNBOUND
Process OMPI jobid: [8540,1] App: 0 Process rank: 4 Bound: UNBOUND

Data for node: cantley-node02 Num slots: 4 Max slots: 0 Num procs: 4
Process OMPI jobid: [8540,1] App: 0 Process rank: 5 Bound: N/A
Process OMPI jobid: [8540,1] App: 0 Process rank: 6 Bound: N/A
Process OMPI jobid: [8540,1] App: 0 Process rank: 7 Bound: N/A
Process OMPI jobid: [8540,1] App: 0 Process rank: 8 Bound: N/A

Data for node: node183 Num slots: 4 Max slots: 0 Num procs: 4
Process OMPI jobid: [8540,1] App: 0 Process rank: 9 Bound: N/A
Process OMPI jobid: [8540,1] App: 0 Process rank: 10 Bound: N/A
Process OMPI jobid: [8540,1] App: 0 Process rank: 11 Bound: N/A
Process OMPI jobid: [8540,1] App: 0 Process rank: 12 Bound: N/A

=============================================================



Cleaning up /scratch:


Relion can either load particles into RAM before processing, using --preread_images, or copy them to local scratch, with --o scratch_dir.   Please see Relion documentation.  Should you load particles to local scratch, if Relion crashes, then these files will not be automatically deleted.  Slurm will clear out old data, after a period of several days, though you should do this yourself by adding the following bash code at the end of your Slurm submission script:


Code Block
# report whether relion crashes
# remove particles from scratch
# if it crashes
if [ $? -eq 0 ]
  then
     echo -e "$SLURM_JOB_ID exited successfully"
  else
     user=$SLURM_JOB_USER
     MY_TMP_DIR=/scratchLocal/${user}_${SLURM_JOB_ID}
     echo -e "$SLURM_JOB_ID failed so cleaning up $MY_TMP_DIR"
     rm -rf $MY_TMP_DIR
fi



...

Filter by label (Content by label)
showLabelsfalse
max5
spacescom.atlassian.confluence.content.render.xhtml.model.resource.identifiers.SpaceResourceIdentifier@28a7b0
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ( "relion" , "cryoem" , "software" ) and type = "page" and space = "WIKI"
labelstest

...