As we’re starting to get into cloud technology in a big way here, I decided to take part in Ubuntu’s IRC training days on the subject, Ubuntu Cloud Days. While the material covered was often more low-level than what we are likely to use in the short term, I found the sessions on Ensemble and CloudInit to be particularly useful, as they’ll be directly applicable to our upcoming GWAS work on the ec2 cloud.
Archive for July, 2011
Ubuntu Cloud Days
Posted: 26 July 2011 by Robert Hutton in System AdministrationTags: amazon, cloud, clouddays, cloudinit, compute, ec2, ensemble, enterprise, eucalyptus, irc, mongo, mysql, node, nova, openstack, orchestra, ubuntu
Python CSV Validator Library
Posted: 21 July 2011 by Alistair Miles in UncategorizedTags: csv, falciparum, python, qa, testing
As part of ongoing data quality-assurance work for MalariaGEN’s P. falciparum Genome Variation project, I’ve written a small Python library called csvvalidator for validating data in CSV files or similar row-oriented data sources.
The source code for csvvalidator is on github, and you call find csvvalidator on the Python package index (so you can do easy_install csvvalidator
).
Here’s a simple example:
Using Interval Trees to Query Genome Annotations by Position
Posted: 7 July 2011 by Alistair Miles in UncategorizedTags: falciparum, genomeannotations, gff, gff3, intervaltree, python, qualityassurance, snps
This week I’ve been doing quality assurance work on some data we’re about to send back to partners of the P. falciparum Genome Variation project. These data include some SNP lists – files listing positions in the P. falciparum genome believed to be variable from one parasite to another. To make these files useful, it helps to include genome annotations – information about which gene (if any) can be found at each variable position. Constructing these files means joining a list of variable positions with a set of genome annotations, where each annotation has a start and end position on some chromosome. I.e., for each variable position, find all genome annotations overlapping that position.
Because I need to do this lookup once for each of about a million SNPs, I wanted to know what the most efficient algorithm for doing this type of query would be. It turns out that Interval Trees are the way to go (thanks Lee for discovering this). It also turns out that there is an implementation of interval trees tailored for searching genome annotations in a package called bx-python, which is very handy as I’ve been writing my QA scripts in Python.
On my Ubuntu desktop installing bx-python is as easy as sudo easy_install bx-python
. There are also instructions for manually installing bx-python if you don’t have access to easy_install.
Below is a snippet from one of my QA scripts which uses the IntervalTree
class from bx-python and builds a set of interval trees from a GFF3 annotations file.
Ubuntu on Amazon EC2 from scratch
Posted: 4 July 2011 by Robert Hutton in HOWTOs, System AdministrationTags: amazon, aws, certificate, cloud, credentials, ec2, groups, image, keypair, security, ssh, ubuntu, X.509
Update: I’ve now rolled this blog post into the Ubuntu wiki’s: EC2 Starters Guide page. Hopefully this helps out the Ubuntu community!
The informatics team here at MalariaGEN have been working with ec2 since before I joined them. So naturally, it’s one technology with which I’ve had to come to grips in the course of doing my job. For me, EC2 had a fairly steep learning curve, and after spending a while trying to learn it through doing, I decided that I would just have to spend some time getting properly to grips with how things worked. As part of that I decided to document it in a way that I’d not yet seen on the web: logically, comprehensively, explaining all the strange concepts and quirks that were clouding my understanding and stopping me from getting my job done efficiently. (more…)