Archive for the ‘Uncategorized’ Category

I just spent a very pleasant afternoon catching up with colleagues at the Image Bioinformatics Research Group, based in the department of Zoology here in Oxford. Here’s a few tidbits I picked up …

Tanya Gray is working on the MIIDI standard (Minimum Information for an Infections Disease Investigation) and associate tools. She’s done some very nice work on a MIIDI metadata editor, using eXist and Orbeon forms, with her own additions to generate XForms from an annotated XML Schema. Tanya’s also working on the DryadUK project, which is a data repository supporting publication of data associated with journal articles.

Stephen Wan (visiting from CSIRO) has developed a cool extension for Firefox (and now Chrome) called IBES (In-Browser Elaborative Summariser). If you point it at Wikipedia, for each link you hover over it shows a summary of the page at that link, built intelligently from the link’s context. Then if you navigate to the link, it tells you where you came from. Very handy if (like me) each visit to Wikipedia is a rambling journey, and you often forget why you went there in the first place. He’s also done some related work to help navigate citations in scholarly articles, called CSIBS (The Citation-Sensitive In-Browser Summarizer).

Alex Dutton is working on the JISC Open Citations project. He has some nice visualisations of citation networks (although one of the articles in that graph looks like it cites itself – if only that were possible :). The graphs are generated using dot from RDF representation of metadata from the PubMedCentral Open-Access journal articles. All of the usual dot options are available, so you can play with how the networks get rendered. The whole site is driven by SPARQL, and the bottom of each page shows the SPARQL queries used to generate the page content, so you can see what’s going on under the hood.

Bhavana Ananda is working on the JISC DataFlow project, the DataStage component of which is a follow-on from previous work by Graham Klyne on the Admiral project. I think the philosophy of simple tools to help research groups manage and share their data with each other has a lot of traction, and I think it’s great they’ve got funding to turn the Admiral prototypes into something more.

Graham Klyne is embroiled in the Workflow 4Ever project, and we had a great chat about possible connections with managing our Plasmodium SNP discovery and genotyping pipelines for MalariaGEN. I’m now expecting Graham to solve all my problems.

And David Shotton (group head) is, as always, making it all happen. It was great to raise my head above the trenches for a few hours, I need to do that more often.


I just stumbled upon Brad Chapman’s Blue Collar Bioinformatics blog, it looks like a great resource, here’s a few tidbits…

Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog – Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions. …

Parallel upload to Amazon S3 with python, boto and multiprocessing – One challenge with moving analysis pipelines to cloud resources like Amazon EC2 is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the HiSeq and decreasing sequencing costs, the data transfer question isn’t going away soon. The use of Amazon in bioinformatics was brought up during a recent discussion on the BioStar question answer site. Deepak’s answer highlighted the role of parallelizing uploads and downloads to ease this transfer burden. Here I describe a method to improve upload speed by splitting over multiple processing cores. …

Next generation sequencing information management and analysis system for Galaxy – Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses. Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy. …

CloudBioLinux: progress on bioinformatics cloud images and data – My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with: A permanent web site at; Additional software and genomic data; New user documentation; A community coding session: Codefest 2010 …

Evaluating key-value and document stores for short read data – Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata. A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. … I decided to do an in-depth evaluation of three stores: Tokyo Cabinet, and its network server Tokyo Tyrant, using the pytyrant library; CouchDB, using the couchdb-python library; MongoDB, using pymongo.

Python CSV Validator Library

Posted: 21 July 2011 by Alistair Miles in Uncategorized
Tags: , , , ,

As part of ongoing data quality-assurance work for MalariaGEN’s P. falciparum Genome Variation project, I’ve written a small Python library called csvvalidator for validating data in CSV files or similar row-oriented data sources.

The source code for csvvalidator is on github, and you call find csvvalidator on the Python package index (so you can do easy_install csvvalidator).

Here’s a simple example:


This week I’ve been doing quality assurance work on some data we’re about to send back to partners of the P. falciparum Genome Variation project. These data include some SNP lists – files listing positions in the P. falciparum genome believed to be variable from one parasite to another. To make these files useful, it helps to include genome annotations – information about which gene (if any) can be found at each variable position. Constructing these files means joining a list of variable positions with a set of genome annotations, where each annotation has a start and end position on some chromosome. I.e., for each variable position, find all genome annotations overlapping that position.

Because I need to do this lookup once for each of about a million SNPs, I wanted to know what the most efficient algorithm for doing this type of query would be. It turns out that Interval Trees are the way to go (thanks Lee for discovering this). It also turns out that there is an implementation of interval trees tailored for searching genome annotations in a package called bx-python, which is very handy as I’ve been writing my QA scripts in Python.

On my Ubuntu desktop installing bx-python is as easy as sudo easy_install bx-python. There are also instructions for manually installing bx-python if you don’t have access to easy_install.

Below is a snippet from one of my QA scripts which uses the IntervalTree class from bx-python and builds a set of interval trees from a GFF3 annotations file.


Background: Clinical Data Curation in MalariaGEN and WWARN

In both MalariaGEN‘s Consortial Projects and WWARN we’ve been involved in aggregating clinical data from different studies and research groups, and a big challenge is dealing with heterogeneity in the source data. There is heterogeneity at multiple levels. We see a variety of file formats. Mostly the data are laid out as columnar tables, but we also see some weird and wonderful layouts. Then there is variety in how the tables are designed – some prefer relatively flat tables with one row per patient, others prefer one row per clinical event, observation or visit. And then there is a lot of diversity in which variables (like temperature, parasitaemia, etc.) have been recorded, how the variables have been named, what units have been used, etc. Finally, top that all off with plenty of subtlety in the semantics of the variables and the data (how was the temperature measured?).

The general approach through this morass is to design a standard schema for the data, with a well-defined set of variables. A transformation is then designed for each of the source datasets, mapping the data onto the standard schema.

The problem we have is that designing a transformation for each of the source datasets is a time-consuming task, requiring expertise on the part of the curator in data transformation techniques as well as lots of knowledge about the domain and experience of different ways of representing the data. These skills don’t often come together in one person. We’ve made various attempts at developing software tools that make designing transformations much easier and less technical, but we certainly don’t have it solved.

The other day I realised what now seems blindingly obvious, which is that SQL and relational views provide a declarative language and tool for designing transformations on columnar tables. This is still not the holy grail of a non-programmer’s tool for designing data transformations, but I thought if I could describe some transformation patterns, along with examples in SQL, that would take us a step in the right direction.

Now, rather than start with the easy stuff like converting temperature in Fahrenheit to temperature in Celsius, or multiplying two columns together, I thought I’d start with the harder cases involving transformations on time series data. Below are a couple of patterns with some SQL, this is not exhaustive by any means, but hopefully an interesting start.


In the world of open source software, licenses like GPL, LGPL, MIT, etc., are generally viewed as a good thing, as they allow the authors of the software to place limited restrictions on the re-use of software according to their preference, whilst still being able to publish the source code. Similarly, for other creative works and open access publishing, the Creative Commons licenses are generally viewed as beneficial, because they allow authors to protect the integrity of their work if desired along with their right to attribution, but not otherwise limit access to or re-use of their work.

So what about scientific data? In MalariaGEN, we are developing policies for “community projects” where partners from independent research institutions around the world to submit samples for sequencing. Ultimately, we would like to make all of the data derived from sequencing those samples available to the scientific research community, but we would also like to protect our partners investment in collecting those samples by ensuring they are attributed when data are re-used. So, I thought, surely the best way to do this is to publish the data under a CC-like license, right?

It turns out this is not the current consensus. Science Commons have published a Protocol for Implementing Open Access Data, which (in section 5) has a good explanation of why using intellectual property rights (i.e., licenses) to enforce norms of attribution or share-alike is a bad idea. So the protocol states that:

[…] to facilitate data integration and open access data sharing, any implementation of this protocol MUST waive all rights necessary for data extraction and re-use […] and MUST NOT apply any obligations on the user of the data or database such as “copyleft” or “share alike”, or even the legal requirement to provide attribution.

This is consistent with policies adopted by major scientific data publishers like the European Nucleotide Archive (ENA), e.g.:

The INSD will not attach statements to records that restrict access to the data, limit the use of the information in these records, or prohibit certain types of publications based on these records. Specifically, no use restrictions or licensing requirements will be included in any sequence data records, and no restrictions or licensing fees will be placed on the redistribution or use of the database by any party.

However, the Science Commons protocol also says that:

Any implementation SHOULD define a non-legally binding set of citation norms in clear, lay-readable language.

I found a short article about how to mount a truecrypt volume from a shell script without showing password in process list, see also the comments which provide some alternatives and clarification.

I use…

echo $password | truecrypt -t -k "" --protect-hidden=no /path/to/ /media/truecrypt1

The -t option makes TrueCrypt work in text mode so you can pipe the password to a terminal prompt, and the other options (-k "" --protect-hidden=no) prevent TrueCrypt giving additional prompts which would otherwise confuse the piping of the password variable.