Posts Tagged ‘python’

I just stumbled upon Brad Chapman’s Blue Collar Bioinformatics blog, it looks like a great resource, here’s a few tidbits…

Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog – Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions. …

Parallel upload to Amazon S3 with python, boto and multiprocessing – One challenge with moving analysis pipelines to cloud resources like Amazon EC2 is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the HiSeq and decreasing sequencing costs, the data transfer question isn’t going away soon. The use of Amazon in bioinformatics was brought up during a recent discussion on the BioStar question answer site. Deepak’s answer highlighted the role of parallelizing uploads and downloads to ease this transfer burden. Here I describe a method to improve upload speed by splitting over multiple processing cores. …

Next generation sequencing information management and analysis system for Galaxy – Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses. Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy. …

CloudBioLinux: progress on bioinformatics cloud images and data – My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with: A permanent web site at; Additional software and genomic data; New user documentation; A community coding session: Codefest 2010 …

Evaluating key-value and document stores for short read data – Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata. A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. … I decided to do an in-depth evaluation of three stores: Tokyo Cabinet, and its network server Tokyo Tyrant, using the pytyrant library; CouchDB, using the couchdb-python library; MongoDB, using pymongo.

Python CSV Validator Library

Posted: 21 July 2011 by Alistair Miles in Uncategorized
Tags: , , , ,

As part of ongoing data quality-assurance work for MalariaGEN’s P. falciparum Genome Variation project, I’ve written a small Python library called csvvalidator for validating data in CSV files or similar row-oriented data sources.

The source code for csvvalidator is on github, and you call find csvvalidator on the Python package index (so you can do easy_install csvvalidator).

Here’s a simple example:


This week I’ve been doing quality assurance work on some data we’re about to send back to partners of the P. falciparum Genome Variation project. These data include some SNP lists – files listing positions in the P. falciparum genome believed to be variable from one parasite to another. To make these files useful, it helps to include genome annotations – information about which gene (if any) can be found at each variable position. Constructing these files means joining a list of variable positions with a set of genome annotations, where each annotation has a start and end position on some chromosome. I.e., for each variable position, find all genome annotations overlapping that position.

Because I need to do this lookup once for each of about a million SNPs, I wanted to know what the most efficient algorithm for doing this type of query would be. It turns out that Interval Trees are the way to go (thanks Lee for discovering this). It also turns out that there is an implementation of interval trees tailored for searching genome annotations in a package called bx-python, which is very handy as I’ve been writing my QA scripts in Python.

On my Ubuntu desktop installing bx-python is as easy as sudo easy_install bx-python. There are also instructions for manually installing bx-python if you don’t have access to easy_install.

Below is a snippet from one of my QA scripts which uses the IntervalTree class from bx-python and builds a set of interval trees from a GFF3 annotations file.