We use Sun Grid Engine here at WTCHG for managing our compute resources. Many of the analyses I’m doing are best run as array jobs, which generally works very well, but sometimes one or more tasks will fail for one reason or another, and I’ve been casting around for best practice when it comes to (a) verifying which tasks succeeded and which failed, and (b) re-running failed tasks.
I found a nice post by Shiran Pasternak on resubmitting failed SGE array tasks, however Shiran doesn’t say how he determined which tasks had failed, and the set of tasks to rerun is specified manually. I have thousands of tasks in each array job, and so I really need an automated way of determining the success/failure of each task and rerunning those that failed.
I came up with the following pattern.
Say, for example, I want to run samtools flagstat over a set of several hundred BAM files. I create two scripts. The first script – flagstat.sh – just wraps the call to samtools flagstat:
#!/bin/bash # # This script generates summary statistics using samtools flagstat for # a single sample. # # debug set -x # main executable SAMTOOLS=/path/to/samtools # assume first argument is sample ID SAMPLE=$1 # path to BAM file BAMFILE=/path/to/${SAMPLE}.bam # assume second argument is location of output file OUTFILE=$2 # do the work $SAMTOOLS flagstat $BAMFILE > $OUTFILE
The second script – flagstat.job.sh – is an SGE job script:
#!/bin/bash # # Job script wrapper for flagstat.sh # # SGE options #$ -S /bin/bash #$ -N pf09_flagstat #$ -m beas #$ -M alimanfoo@googlemail.com #$ -cwd #$ -l vf=40M #$ -l h_vmem=100M #$ -l h_rt=1:59:0 #$ -t 1-428 #$ -o history #$ -j y # debug set -x # main script MAIN=./flagstat.sh # log file LOG=log # sample manifest - text file with one sample ID per line MANIFEST=/path/to/samples.txt # dereference task ID to sample ID SAMPLE=`awk "NR==${SGE_TASK_ID}" ${MANIFEST}` # expected location of output and MD5 verification files OUTFILE=outputs/${SAMPLE}.flagstat MD5FILE=${OUTFILE}.md5 # check if MD5 file already exists and matches output file if [[ -f $OUTFILE && -f $MD5FILE && `md5sum ${OUTFILE} | cut -f1 -d" "` = `cat ${MD5FILE} | cut -f1 -d" "` ]]; then # task was previously run successfully, skip this time echo -e `date` "\t${JOB_ID}\t${SGE_TASK_ID}\t${SAMPLE}\tSKIP" >> $LOG exit 0 else # do the main work $MAIN $SAMPLE $OUTFILE # check exit status STATUS=$? if [[ $STATUS -eq 0 ]]; then # success, write MD5 verification file echo `md5sum $OUTFILE` > $MD5FILE echo -e `date` "\t${JOB_ID}\t${SGE_TASK_ID}\t${SAMPLE}\tOK" >> $LOG else echo -e `date` "\t${JOB_ID}\t${SGE_TASK_ID}\t${SAMPLE}\tFAIL\t${STATUS}" >> $LOG fi exit $STATUS fi
The main idea in this job script is that, if the main executable completes successfully, a “verification file” will be written, containing the MD5 hash of the task’s output file. Before running the main executable, the script checks whether the output file already exists, and the verification file already exists, and the MD5 hash it contains matches the MD5 sum of the output file – if so then the script assumes that the task was previously run successfully, and skips the task this time.
The point of all this is that a call to qsub flagstat.job.sh
is effectively idempotent. I.e., if some tasks failed in a previous run, I can just call qsub flagstat.job.sh
again, and it will automatically run only those tasks that failed previously.
This job script also writes some simple output to a log file, which is just a convenient file for me to scan visually to see if any tasks failed and I need to resubmit the job – the same information could probably also be got via qacct
.
This works for me but if you have a more elegant solution I’d love to hear it.
Having used this pattern for a day, I’ve realised I don’t like it, for a couple of reasons. First, I still don’t have any way of verifying that everything has run successfully to completion. Scanning the log file isn’t good enough, because sometimes jobs fail before an error message is written to the log. I could write another script to do the verification, but that seems a bit redundant given that the job scripts are also checking if a task previously ran successfully. Second, it hammers the queuing system unnecessarily, especially on re-runs where only a handful of tasks failed previously and need to be re-run.
Here’s an alternative pattern, still based on using MD5 verification files to indicate a successfully completed task, but dealing with the two points above.
I have three scripts: mktodo.sh, qsub.sh and main.sh.
The first script –
mktodo.sh
– runs through the complete manifest of samples and checks to see if the analysis has been previously run successfully (using the MD5 file), and writes a TODO list of samples for which the analysis needs to be run, e.g.:The second script –
qsub.sh
– submits an array job for only those samples in the TODO list, e.g.:The third script –
main.sh
– does the actual work of a task, e.g.:So I can run
mktodo.sh
and then inspect thetodo
file to see if there are any samples to run, followed byqsub.sh todo
to run the analyses for just those samples.A similar thing could also be done using a heredoc instead of the qsub.sh script, but I find bash scripts hard enough to read at the best of times, so I decided against writing a bash script to generate a bash script – I know I’d come back to it in a few months time and find it hard to decipher.
I just realised that
md5sum
has a--check
option, so in the mktodo.sh script instead of doing:…I can just do:
To make this work, I also need to do
md5sum $OUTFILE > $MD5FILE
in the main.sh script, instead ofecho `md5sum $OUTFILE` > MD5FILE
– not sure why I was doing that in the first place, obviously the echo is unnecessary and in this case causes the output from md5sum to be slightly mangled.Thanks for posting this. I’ve had the same problem, and oddly enough, this was the only relevant Google result on the issue. (One of the links you refer to is dead.)
Just thought you’d like to know someone out there appreciates this!