This week I’ve been doing quality assurance work on some data we’re about to send back to partners of the P. falciparum Genome Variation project. These data include some SNP lists – files listing positions in the P. falciparum genome believed to be variable from one parasite to another. To make these files useful, it helps to include genome annotations – information about which gene (if any) can be found at each variable position. Constructing these files means joining a list of variable positions with a set of genome annotations, where each annotation has a start and end position on some chromosome. I.e., for each variable position, find all genome annotations overlapping that position.
Because I need to do this lookup once for each of about a million SNPs, I wanted to know what the most efficient algorithm for doing this type of query would be. It turns out that Interval Trees are the way to go (thanks Lee for discovering this). It also turns out that there is an implementation of interval trees tailored for searching genome annotations in a package called bx-python, which is very handy as I’ve been writing my QA scripts in Python.
On my Ubuntu desktop installing bx-python is as easy as sudo easy_install bx-python
. There are also instructions for manually installing bx-python if you don’t have access to easy_install.
Below is a snippet from one of my QA scripts which uses the IntervalTree
class from bx-python and builds a set of interval trees from a GFF3 annotations file.
import csv from bx.intervals.intersection import IntervalTree def index_gff3(gff3_file_path): # dictionary mapping chromosome names to interval trees genome = dict() # parse the annotations file (GFF3) and build the interval trees with open(gff3_file_path, 'r') as annotations_file: reader = csv.reader(annotations_file, delimiter='\t') for row in reader: if len(row) == 9 and not row[0].startswith('##'): seqid = row[0] start = int(row[3]) end = int(row[4]) tree = None # one interval tree per chromosome if seqid in genome: tree = genome[seqid] else: # first time we've encountered this chromosome, create an interval tree tree = IntervalTree() genome[seqid] = tree # index the feature tree.add(start, end, tuple(row)) return genome
I can then use this function to do things like…
genome = index_gff3('/path/to/annotations.gff') annotations = genome['apidb|MAL1'].find(474889, 474889) # find annotations overlapping a single position annotations = genome['apidb|MAL1'].find(470000, 480000) # find annotations overlapping an interval
I discovered that, if you want to find annotations overlapping a single position at the start or end of the annotation, then you need to do: