Posts Tagged ‘falciparum’

Python CSV Validator Library

Posted: 21 July 2011 by Alistair Miles in Uncategorized
Tags: , , , ,

As part of ongoing data quality-assurance work for MalariaGEN’s P. falciparum Genome Variation project, I’ve written a small Python library called csvvalidator for validating data in CSV files or similar row-oriented data sources.

The source code for csvvalidator is on github, and you call find csvvalidator on the Python package index (so you can do easy_install csvvalidator).

Here’s a simple example:



This week I’ve been doing quality assurance work on some data we’re about to send back to partners of the P. falciparum Genome Variation project. These data include some SNP lists – files listing positions in the P. falciparum genome believed to be variable from one parasite to another. To make these files useful, it helps to include genome annotations – information about which gene (if any) can be found at each variable position. Constructing these files means joining a list of variable positions with a set of genome annotations, where each annotation has a start and end position on some chromosome. I.e., for each variable position, find all genome annotations overlapping that position.

Because I need to do this lookup once for each of about a million SNPs, I wanted to know what the most efficient algorithm for doing this type of query would be. It turns out that Interval Trees are the way to go (thanks Lee for discovering this). It also turns out that there is an implementation of interval trees tailored for searching genome annotations in a package called bx-python, which is very handy as I’ve been writing my QA scripts in Python.

On my Ubuntu desktop installing bx-python is as easy as sudo easy_install bx-python. There are also instructions for manually installing bx-python if you don’t have access to easy_install.

Below is a snippet from one of my QA scripts which uses the IntervalTree class from bx-python and builds a set of interval trees from a GFF3 annotations file.