Handling Failures and Rerunning Tasks in Sun Grid Engine Array Jobs

Posted: 11 April 2012 by Alistair Miles in Software
Tags: , , ,

We use Sun Grid Engine here at WTCHG for managing our compute resources. Many of the analyses I’m doing are best run as array jobs, which generally works very well, but sometimes one or more tasks will fail for one reason or another, and I’ve been casting around for best practice when it comes to (a) verifying which tasks succeeded and which failed, and (b) re-running failed tasks.

I found a nice post by Shiran Pasternak on resubmitting failed SGE array tasks, however Shiran doesn’t say how he determined which tasks had failed, and the set of tasks to rerun is specified manually. I have thousands of tasks in each array job, and so I really need an automated way of determining the success/failure of each task and rerunning those that failed.

I came up with the following pattern.

Say, for example, I want to run samtools flagstat over a set of several hundred BAM files. I create two scripts. The first script – flagstat.sh – just wraps the call to samtools flagstat:

#!/bin/bash

#
# This script generates summary statistics using samtools flagstat for
# a single sample.
#

# debug
set -x

# main executable
SAMTOOLS=/path/to/samtools

# assume first argument is sample ID
SAMPLE=$1

# path to BAM file
BAMFILE=/path/to/${SAMPLE}.bam

# assume second argument is location of output file
OUTFILE=$2

# do the work
$SAMTOOLS flagstat $BAMFILE > $OUTFILE

The second script – flagstat.job.sh – is an SGE job script:

#!/bin/bash

# 
# Job script wrapper for flagstat.sh
#

# SGE options
#$ -S /bin/bash
#$ -N pf09_flagstat
#$ -m beas
#$ -M alimanfoo@googlemail.com
#$ -cwd
#$ -l vf=40M
#$ -l h_vmem=100M
#$ -l h_rt=1:59:0
#$ -t 1-428
#$ -o history
#$ -j y

# debug
set -x

# main script
MAIN=./flagstat.sh

# log file
LOG=log

# sample manifest - text file with one sample ID per line
MANIFEST=/path/to/samples.txt

# dereference task ID to sample ID
SAMPLE=`awk "NR==${SGE_TASK_ID}" ${MANIFEST}`

# expected location of output and MD5 verification files
OUTFILE=outputs/${SAMPLE}.flagstat
MD5FILE=${OUTFILE}.md5

# check if MD5 file already exists and matches output file                                                       
if [[ -f $OUTFILE && -f $MD5FILE && `md5sum ${OUTFILE} | cut -f1 -d" "` = `cat ${MD5FILE} | cut -f1 -d" "` ]]; then

    # task was previously run successfully, skip this time
    echo -e `date` "\t${JOB_ID}\t${SGE_TASK_ID}\t${SAMPLE}\tSKIP" >> $LOG
    exit 0

else

    # do the main work
    $MAIN $SAMPLE $OUTFILE

    # check exit status
    STATUS=$?
    if [[ $STATUS -eq 0 ]]; then

        # success, write MD5 verification file
        echo `md5sum $OUTFILE` > $MD5FILE
	echo -e `date` "\t${JOB_ID}\t${SGE_TASK_ID}\t${SAMPLE}\tOK" >> $LOG

    else

        echo -e `date` "\t${JOB_ID}\t${SGE_TASK_ID}\t${SAMPLE}\tFAIL\t${STATUS}" >> $LOG

    fi

    exit $STATUS

fi

The main idea in this job script is that, if the main executable completes successfully, a “verification file” will be written, containing the MD5 hash of the task’s output file. Before running the main executable, the script checks whether the output file already exists, and the verification file already exists, and the MD5 hash it contains matches the MD5 sum of the output file – if so then the script assumes that the task was previously run successfully, and skips the task this time.

The point of all this is that a call to qsub flagstat.job.sh is effectively idempotent. I.e., if some tasks failed in a previous run, I can just call qsub flagstat.job.sh again, and it will automatically run only those tasks that failed previously.

This job script also writes some simple output to a log file, which is just a convenient file for me to scan visually to see if any tasks failed and I need to resubmit the job – the same information could probably also be got via qacct.

This works for me but if you have a more elegant solution I’d love to hear it.

Advertisement
Comments
  1. Having used this pattern for a day, I’ve realised I don’t like it, for a couple of reasons. First, I still don’t have any way of verifying that everything has run successfully to completion. Scanning the log file isn’t good enough, because sometimes jobs fail before an error message is written to the log. I could write another script to do the verification, but that seems a bit redundant given that the job scripts are also checking if a task previously ran successfully. Second, it hammers the queuing system unnecessarily, especially on re-runs where only a handful of tasks failed previously and need to be re-run.

    Here’s an alternative pattern, still based on using MD5 verification files to indicate a successfully completed task, but dealing with the two points above.

    I have three scripts: mktodo.sh, qsub.sh and main.sh.

    The first script – mktodo.sh – runs through the complete manifest of samples and checks to see if the analysis has been previously run successfully (using the MD5 file), and writes a TODO list of samples for which the analysis needs to be run, e.g.:

    #!/bin/bash
    
    # 
    # Script to construct a TODO list of samples for which this analysis
    # has not been successfully completed.
    #
    
    # bail out on first error
    set -e
    
    # debug
    set -x
    
    # master sample manifest
    MANIFEST=/path/to/samples.txt
    
    # file to output TODO list
    TODO=todo
    [[ -f $TODO ]] && rm -f $TODO
    touch $TODO
    
    # iterate through samples in manifest
    for SAMPLE in `cat $MANIFEST`
    do
    
        # expected location of output and MD5 verification files
        OUTFILE=outputs/${SAMPLE}.flagstat
        MD5FILE=${OUTFILE}.md5
    
        # check if MD5 file already exists and matches output file
        if [[ ! -f $OUTFILE || ! -f $MD5FILE || `md5sum ${OUTFILE} | cut -f1 -d" "` != `cat ${MD5FILE} | cut -f1 -d" "` ]]
        then
    
    	echo $SAMPLE >> $TODO 
    
        fi
    
    done
    

    The second script – qsub.sh – submits an array job for only those samples in the TODO list, e.g.:

    #!/bin/bash
    
    # 
    # Job submission script. 
    #
    
    # bail out on first error
    set -e
    
    # make sure variables are set
    set -u
    
    # debug
    set -x
    
    # job name
    JOBNAME=pf09_flagstat
    
    # main script
    MAIN=./main.sh
    
    # list of samples to be processed (assume first argument)
    TODO=$1
    
    # are there any samples to do?
    NUMTASKS=`wc -l $TODO | cut -f1 -d" "`
    
    if [[ $NUMTASKS -eq 0 ]]
    then
    
        echo "Nothing to do."
        exit 0
    
    else
    
        echo "Submitting an array job with $NUMTASKS tasks..."
    
        # submit an array job
        qsub -S /bin/bash \
    	-N $JOBNAME \
    	-m beas \
    	-M alimanfoo@googlemail.com \
    	-cwd \
    	-l vf=40M \
    	-l h_vmem=100M \
    	-l h_rt=1:59:0 \
    	-t 1-$NUMTASKS \
    	-o history \
    	-j y \
    	$MAIN $TODO
    
    fi
    

    The third script – main.sh – does the actual work of a task, e.g.:

    #!/bin/bash
    
    #
    # This script generates summary statistics using samtools flagstat for
    # a single sample. It is assumed this script will be run as part of an
    # SGE array job.
    #
    
    # bail out on first error
    set -e
    
    # make sure variables are set
    set -u
    
    # debug
    set -x
    
    # main executable
    SAMTOOLS=/path/to/samtools
    
    # sample manifest for the job (assume first argument)
    MANIFEST=$1
    
    # dereference task ID to sample ID (Oxford code)
    SAMPLE=`awk "NR==${SGE_TASK_ID}" ${MANIFEST}`
    
    # path to BWA BAM file
    BAMFILE=/path/to/${SAMPLE}.bam
    
    # location of output and MD5 verification files
    OUTFILE=outputs/${SAMPLE}.flagstat
    MD5FILE=${OUTFILE}.md5
    
    # do the work
    $SAMTOOLS flagstat $BAMFILE > $OUTFILE
    
    # write MD5 verification file
    echo `md5sum $OUTFILE` > $MD5FILE
    

    So I can run mktodo.sh and then inspect the todo file to see if there are any samples to run, followed by qsub.sh todo to run the analyses for just those samples.

    A similar thing could also be done using a heredoc instead of the qsub.sh script, but I find bash scripts hard enough to read at the best of times, so I decided against writing a bash script to generate a bash script – I know I’d come back to it in a few months time and find it hard to decipher.

  2. I just realised that md5sum has a --check option, so in the mktodo.sh script instead of doing:

        if [[ ! -f $OUTFILE || ! -f $MD5FILE || `md5sum ${OUTFILE} | cut -f1 -d" "` != `cat ${MD5FILE} | cut -f1 -d" "` ]]
        then
            echo $SAMPLE >> $TODO
        fi
    

    …I can just do:

    md5sum --check $MD5FILE || echo $SAMPLE >> $TODO
    

    To make this work, I also need to do md5sum $OUTFILE > $MD5FILE in the main.sh script, instead of echo `md5sum $OUTFILE` > MD5FILE – not sure why I was doing that in the first place, obviously the echo is unnecessary and in this case causes the output from md5sum to be slightly mangled.

  3. Carlo says:

    Thanks for posting this. I’ve had the same problem, and oddly enough, this was the only relevant Google result on the issue. (One of the links you refer to is dead.)

    Just thought you’d like to know someone out there appreciates this!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s