cryoEM Workflow

To all users: when you encounter an issue with using the cluster, please perform the following steps before contacting SCU. If you do not indicate that you have tried these steps, your inquiry might not be answered.

 

  1. Define what type of issue you are having.

    1. Execution issue: the job is either unable to start, or aborts at a defined step; usually gives an error message.

    2. Performance issue: a job is taking a long time to start, performing abnormally slow compared to what you are used to, or ends prematurely due to time limits. 

  2. Nail down the source of the issue

    1. It is helpful to self-diagnose if this is due to user-error, an issue with an individual node, or a cluster-wide issue. Carefully read the error message in the slurm output and identify possible issues (cannot find file xxx, out of memory, job was canceled …).

    2. Triple-check your inputs: syntax, file input formats, scripts, MPI, threads, nodes, memory, etc.

    3. Check the status of the node you have used – if a node is down, or unable to start new jobs, please report to SCU

    4. Test a comparable job that previously worked properly, both on the same node and other nodes; if it works properly, this is most likely an issue with your job inputs or settings – check if everything was done correctly in your processing workflow. 


To check what node your job was submitted to (replace slurmID with your number):

sacct -j slurmID --format=jobid,partition,NodeList

To check the status of all the nodes in the cluster: 

sinfo --long --Node

 

  1. Execution Issues: 

    1. Locate your slurm output file and check for error messages (particularly useful for relion but also applies to cryoSPARC). Common errors are: 

    2. MPI abort – try different node/partition, if this works, please report this to SCU as described in Step 4, as it could mean that some software/libraries/settings have been inconsistently applied.

    3. Out of memory – reduce the number of MPI/threads

    4. Illegal instructions – something is probably wrong with your submission script

    5. If these don’t work, Google a portion of the error message - many issues are not unique and have already been encountered by someone else, and they have posted on various forums. Even if it doesn’t provide a solution, this can at least give a sense of what the issue is, and you might find information that can help future troubleshooting with fellow users or SCU. For relion the entire ccpem email list is archived and searchable. Cryosparc has its own, very useful user forum where errors are reported, and solutions are discussed. 

  2. Performance Issues: 

    1. Check if scratch space full - Relion and cryoSPARC slurm outputs provide this information. If scratch is full, please report to SCU. 

    2. Make sure you are submitting reasonable jobs – micrographs and particles should be appropriately binned. 

    3. Test the problematic job on different partitions and nodes. If the problem is node-dependent, please report this to SCU!

  3. Ask your colleagues!

    1. Have a fellow lab member see if they are having comparable problems is also helpful, especially with cryoSPARC.

    2. We have an active WCMC Cryo-EM Slack forum where expert cluster users are happy to help – again, many cluster-specific errors have likely been navigated by someone else. Also, there is a lot of expertise on software-specific issues. 

    3. If an issue is not because of user-error, other colleagues likely have the same issue, so please report this on Slack – identifying if and how other users are experiencing the same issues can help us quickly converge on the source of the issue and get things fixed much faster by SCU. 


  1. If all else fails, contact SCU. 

    1. E-mail scu@med.cornell.edu

  2. When you email SCU, acknowledge you have gone through this workflow, including the following steps (when applicable): 

    1. Identify the ‘type’ of issue (Execution or Performance)

    2. Slurm IDs and the node that the error occurred on. 

    3. The tested parameters for specific test jobs.

    4. Absolute path to slurm output for the error message. /athena/YOURlab/scratch/CWID/…/…/…

    5. If other people reported similar issues.

    6. Screenshots, if applicable, as attachments within the email