Archive for May, 2011

Intro to AtomBeat

Posted: 27 May 2011 by Alistair Miles in Uncategorized
Tags: , , , , ,

This post is by way of a brief introduction to AtomBeat, which is a piece of software I’ve been coding on-and-off for the past year or so. AtomBeat is a web application that you can use as a generic content and data repository. It implements the Atom Publishing Protocol, which is a standard for data-centric web services, and which is also the basis for things like GData, OData, CMIS and SWORD. AtomBeat also implements a number of more-or-less standard extensions to Atom, including support for fine-grained access control policies, and support for versioning of data resources. The idea was that we had several applications which had similar underlying needs w.r.t. managing data and content and exposing functionality to other applications via web services – so AtomBeat is an attempt to factor out some of those common capabilities into a generic piece of software.

AtomBeat is coded almost entirely in XQuery – about 14,000 lines of it at last count – which, you might think, makes me totally mad. But AtomBeat uses the eXist XML database as the underlying persistence engine, and because Atom is an XML-based protocol, using XQuery to work with XML both at the protocol level and the persistence level is actually very convenient, once you’ve established a few patterns. I’d hate to think how many lines of code it would have taken if I’d done it all in Java using DOM and SAX APIs.

Anyway, I’ll leave it there for now, but if you’d like to know more check out the AtomBeat wiki and the release notes.

Advertisement

Ubuntu on EC2

Posted: 27 May 2011 by Alistair Miles in Uncategorized
Tags: , ,

This is just a short post to say, if you’re into Ubuntu, and you’re into Amazon EC2, then Eric Hammond’s alestic.com, Scott Moser’s blog, and the ec2-ubuntu mailing list are indispensable resources.

SNP-o-matic

Posted: 27 May 2011 by Alistair Miles in Uncategorized
Tags: , , ,

One of our main tasks is to develop and run a data processing pipeline, that takes in raw sequencing data from many different parasite DNA samples, then finds all the positions in the parasite genome where some of these samples are different from others (“SNP discovery”), then makes a “call” for each sample and for each variable position in the genome (“SNP”) as to which nucleotide (A,C, T or G) is found there (“genotyping”).

Once we’ve done that, we can tell you something about how one parasite is different from another, or how one group of parasites is different from another group, which might be interesting if, for example, one or more of those parasites are resistant to a particular drug.

Anyway, this is harder than you might think, and is complicated by all sorts of factors, not least that we have to work with lots and lots of relatively short pieces of DNA, and that the malaria parasite genome has long stretches of really repetitive sequence, and that a DNA sample might actually be from a person who was infected with 2 or more different parasites (a “mixed infection”) and so you’re actually looking at 2 or more genomes in one sample, …, not to mention that you have to work with lots of data, so use of memory and compute power needs to be efficient.

This is something we’ve been working on for a couple of years (well, I say “we”, but I take no credit, this is all down to Magnus, Gareth, Dominic and others). Back in 2009 Magnus published an article on something called SNP-o-matic, which is a piece of software he wrote to perform SNP discovery and genotyping for parasite samples, and to deal with some of the complications I’ve mentioned above. SNP-o-matic is still a key part of our pipelines, and Magnus has just finished some work rewriting it, primarily to add some new features for finding different types of genetic variation. But this is not a solved problem, and we know we can do better, so this is something that’s going to keep us busy for a while yet!

This is fairly old news now, but there was an article in Wired magazine last year on MapSeq, which is a web application developed by Jacob, Olivo, Magnus and Dushy for working with genomic variation data from the malaria parasite.

MapSeq does quite a few fierce things, but the feature I like most is the interactive principal components analysis plot of genetic variation, which are done with the HTML5 canvas and a good dollop of Javascript (I believe – Jacob, correct me if I’m wrong). These show really clearly how parasites from different geographical regions (e.g., South-East Asia vs Africa) are genetically more distinct than parasites from within the same region. The screenshot below shows a PCA plot I just made which segregates samples from Africa (the group on the left) from samples from Thailand and Cambodia (the group on the right).

Principal Components Analysis plot showing genetic differences between malaria parasite populations

This is what you’d expect, because there is much more malaria transmission within a region than between different parts of the world. But the interactive plots really come into their own when you start to zoom in on a particular region. You can start to ask questions like, how different are the Thai and Cambodian populations from each other? Are there distinct sub-populations within Thailand or Cambodia?

Of course, the holy grail would be to add a temporal dimension to this. If we could see how parasite populations are changing genetically over time, and relate that to geographical location, we might be able to see drug resistance genes migrating from Cambodia to Thailand – evolution in action! But I’m getting ahead of myself, one step at a time…

Hello world!

Posted: 26 May 2011 by Alistair Miles in Uncategorized
Tags:
System.out.println("Welcome to the MalariaGEN Informatics team blog!");

We are a small team of computer scientists, software engineers, web developers and sys/db-admins, working on systems to support the Malaria Genomic Epidemiology Network. If you’d like to know more, read about us.