Most bioinformatic analyses on large datasets will involve at least one step that takes a long time to run. Long running commands pose special problems. In particular we would like to;
See progress as the command runs
Allow the command to run in the background so that we can keep working while it runs.
Allow the command to keep running even if we logout of the server
Unfortunately if you simply run a long running command using the run button on an RMarkdown document you won’t achieve any of the goals listed above.
Unless your command runs very quickly you will need to package it up into a slurm
script and run it using the slurm
queing system on the RStudio server.
Here’s how to do it.
Let’s assume we want to run the following command using a slurm script
echo "Hello SLURM"
First create a file for your script. You can do this within rstudio by choosing File -> New File -> Shell Script. Give your file a name that reflects the purpose of your script. In this case hello.sh
would be appropriate.
Now open your script file using the rstudio code editor (click the file in the Files browser) and enter the following text
#!/bin/bash
#SBATCH --time=60
#SBATCH --ntasks=2 --mem=4gb
echo "Hello SLURM"
Now you are ready to run your script. Open a Terminal
window and enter the following command
sbatch hello.sh
You should see a response something like
Submitted batch job XX
Where XX is a number. This is your job number. You should also see a file appear in your project called slurm-XX.out
(again where XX is your job number).
You can check progress of your job in a couple of ways.
slurm-XX.out
tail slurm-XX.out
squeue
If your job is finished it will disappear from the queue, so one way to definitively check for job completion is to run squeue
and check to see if your job is there. Note that during busy times there might be several jobs from other users in the queue. You should be able to see tell which job is yours because it will be marked with your user name.
Sometimes your slurm script will not run, or it might crash before it is finished. Some common reasons are;
#SBATCH --ntasks=2 --mem=4gb
Here ntasks
should be set to the number of CPUs that the job will use. If you are in doubt leave this value at 2. At some points in the guide you will be instructed to set this to a certain value. mem
indicates the amount of memory required. Again, follow the guide here. If you need more than 4gb of memory the guide will tell you what is needed.
qiime
which is not a normal command, but an alias. If you enter qiime
within a slurm script you might see an error like this
slurm_script: line 5: qiime: command not found
This can be fixed by adding the alias into your script. So for qiime
commands you need to add the following code in your script before the qiime
command
shopt -s expand_aliases
alias qiime=`apptainer run -B /pvol/:/pvol /pvol/data/sif/qiime.sif qiime`