Archive for the ‘Uncategorized’ Category

The Anopheles gambiae 1000 genomes project is presenting us with some technical challenges, as genetic diversity within the mosquito populations we are studying is extremely high. Although the A. gambiae reference genome (~250Mb) is an order of magnitude smaller than the human genome, we still discover about 100 million SNPs, of which about half pass a reasonably conservative set of filters, which works out to about 1 good SNP every 5 bases or so.

Doing any kind of exploratory analysis of a dataset of ~100 million SNPs genotyped across ~1000 samples is difficult, and working directly from VCF files is impractical, because of the time it takes to parse. Genotype calls can be represented as two-dimensional arrays of numerical data, and there are a number of well-established and emerging software tools and standards for dealing in a generic way with large multi-dimensional arrays, so we’ve been doing some investigation and trying to leverage this work to speed up our analysis workflow.

In particular, the HDF5 format is well supported, and we’ve got a lot of mileage out of it already. I’ve been working on a package called vcfnp which provides support for converting data from a VCF file first into NumPy arrays, and from there to HDF5. You have to make some choices when loading data into an HDF5 file, in particular what type of compression to use, and how to chunk the data. In order to make an informed decision, I did some benchmarking, looking at performance under a number of access scenarios, comparing different compression options and chunk layouts.

The main finding was that using a chunk size of around 128kb, and a fairly narrow chunk width of around 10, provides a very good compromise solution, with good read performance under both column-wise and row-wise access patterns. While other compression options are available and are slightly faster, gzip is very acceptable, and is more widely supported, so we’ll be sticking with that for now. See the notebook linked above for the gory details.


Anopheles gambiae 1000 genomes project

Posted: 22 October 2014 by Alistair Miles in Uncategorized

Back in June we officially launched the Anopheles gambiae 1000 genomes project, which is a consortial project generating and analysing whole genome sequence data on wild-caught mosquitoes of the species Anopheles gambiae and Anopheles coluzzii, the major vectors of Plasmodium falciparum malaria in Africa.

Along with the initial web page, we also made our first data release. The phase 1 preview release contains genotype data on 103 mosquitoes from Uganda, contributed by Martin Donnelly and David Weetman of the Liverpool School of Tropical Medicine. VCF files are available to download from the Ag1000G public FTP site, and there is also an early version of the Panoptes web application which provides an interactive environment for exploring the data.

The consortium is currently working hard on preparing and analysing the full phase 1 dataset, which comprises 765 samples from 8 countries spanning sub-Saharan Africa. We hope to release at least a beta version of these data before the end of the year, I’ll post here when it’s available.

Patterns of resistance by Joel Winston

Posted: 5 September 2013 by Alistair Miles in Uncategorized

A short video on the problem of anti-malarial drug resistance and the role of genome sequencing in parasite surveillance.

Malaria Drug Resistance and Genomics Animation

Posted: 3 July 2013 by Alistair Miles in Uncategorized

Recently Olivo Miotto and members of the MalariaGEN teams at Oxford and Sanger, in collaboration with teams studying malaria at 10 locations in West Africa and Southeast Asia, published a paper on multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia. To present the findings at this week’s Royal Society Summer Science Exhibition the MalariaGEN communications team have put together a short animation, enjoy!

I’ve created a liftover chain file to migrate genomic data from the “version 2” 3D7 reference genome to the newer “version 3” reference genome. You can download the chain file at the link below, as well as a binary for the liftOver program compiled for x86_64:

To check it works, download the above and test.bed to a local directory then run:

chmod +x ./liftOver
./liftOver test.bed 2to3.liftOver test.v3.bed test.v3.unmapped

This should create the file test.v3.bed containing:

Pf3D7_07_v3	403620	403621	crt

Note that this expects chromosome names in the input to be like “Pf3D7_01”. If you’re using chromosome names like “MAL1” you’ll need to convert those first prior to applying the liftover to version 3.


Load data from a VCF file into numpy arrays

Posted: 22 February 2013 by Alistair Miles in Uncategorized

I’ve recently been doing some analysis of SNPs and indels from the MalariaGEN P. falciparum genetic crosses project, and have found it convenient to load variant call data from VCF files into numpy arrays to compute summary statistics, make plots, etc.

Attempt 1: vcfarray

I initially wrote a small Python library for loading the arrays based on the excellent PyVCF module. This works well but is a little slow, and when I profiled it it was the VCF parsing that was the bottleneck, so I went in search of a C/C++ library I could use from Cython…

Attempt 2: vcfnp

Erik Garrison’s vcflib library provides a nice C++ API for parsing a VCF file, so I had a go at writing a Cython module based on that. Performance is better, I get roughly 2-4X speed-up over the PyVCF-based implementation, although I was hoping for an order of magnitude … I guess it’s just the case that string parsing is relatively slow, even in C/C++, and we should be using BCF2.

To install and try vcfnp for yourself, do:

pip install vcfnp

See the vcfnp README for some examples of usage.

Job – Clinical Data Curator

Posted: 12 March 2012 by Alistair Miles in Jobs, Uncategorized

We’re advertising for a clinical data curator, here’s a snippet from the job ad:

Applications are invited for a MalariaGEN Clinical Data Curator to work in a data-sharing community developing new tools to control malaria by integrating epidemiology with genome science.

You will be a member of the MalariaGEN resource centre and will focus on MalariaGEN consortial data. This data relates to three of our human consortial projects; Genetic determinants of resistance to malaria; Genetic determinants of the immune response to malaria and Human genome variation in malaria-endemic regions.

…the actual work involves curating large amounts of clinical data relating to cases of severe malaria from Africa. The data originate from different countries, studies, research groups, … basically the data can be quite heterogeneous, and needs to be carefully managed, standardised and quality controlled to enable the data to be aggregated then analysed.

Dealing with clinical research data on this scale has been an ongoing challenge for our team for many years, and remains a critical part of realising MalariaGEN‘s studies of human resistance to malaria. If you enjoy dealing with real-world data management problems, have an interest in the life sciences and/or public health, and enjoy working with a diverse community of people from different parts of the world, your application would be very welcome.

Further details are available at the link below:

I just spent a very pleasant afternoon catching up with colleagues at the Image Bioinformatics Research Group, based in the department of Zoology here in Oxford. Here’s a few tidbits I picked up …

Tanya Gray is working on the MIIDI standard (Minimum Information for an Infections Disease Investigation) and associate tools. She’s done some very nice work on a MIIDI metadata editor, using eXist and Orbeon forms, with her own additions to generate XForms from an annotated XML Schema. Tanya’s also working on the DryadUK project, which is a data repository supporting publication of data associated with journal articles.

Stephen Wan (visiting from CSIRO) has developed a cool extension for Firefox (and now Chrome) called IBES (In-Browser Elaborative Summariser). If you point it at Wikipedia, for each link you hover over it shows a summary of the page at that link, built intelligently from the link’s context. Then if you navigate to the link, it tells you where you came from. Very handy if (like me) each visit to Wikipedia is a rambling journey, and you often forget why you went there in the first place. He’s also done some related work to help navigate citations in scholarly articles, called CSIBS (The Citation-Sensitive In-Browser Summarizer).

Alex Dutton is working on the JISC Open Citations project. He has some nice visualisations of citation networks (although one of the articles in that graph looks like it cites itself – if only that were possible :). The graphs are generated using dot from RDF representation of metadata from the PubMedCentral Open-Access journal articles. All of the usual dot options are available, so you can play with how the networks get rendered. The whole site is driven by SPARQL, and the bottom of each page shows the SPARQL queries used to generate the page content, so you can see what’s going on under the hood.

Bhavana Ananda is working on the JISC DataFlow project, the DataStage component of which is a follow-on from previous work by Graham Klyne on the Admiral project. I think the philosophy of simple tools to help research groups manage and share their data with each other has a lot of traction, and I think it’s great they’ve got funding to turn the Admiral prototypes into something more.

Graham Klyne is embroiled in the Workflow 4Ever project, and we had a great chat about possible connections with managing our Plasmodium SNP discovery and genotyping pipelines for MalariaGEN. I’m now expecting Graham to solve all my problems.

And David Shotton (group head) is, as always, making it all happen. It was great to raise my head above the trenches for a few hours, I need to do that more often.

I just stumbled upon Brad Chapman’s Blue Collar Bioinformatics blog, it looks like a great resource, here’s a few tidbits…

Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog – Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions. …

Parallel upload to Amazon S3 with python, boto and multiprocessing – One challenge with moving analysis pipelines to cloud resources like Amazon EC2 is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the HiSeq and decreasing sequencing costs, the data transfer question isn’t going away soon. The use of Amazon in bioinformatics was brought up during a recent discussion on the BioStar question answer site. Deepak’s answer highlighted the role of parallelizing uploads and downloads to ease this transfer burden. Here I describe a method to improve upload speed by splitting over multiple processing cores. …

Next generation sequencing information management and analysis system for Galaxy – Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses. Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy. …

CloudBioLinux: progress on bioinformatics cloud images and data – My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with: A permanent web site at; Additional software and genomic data; New user documentation; A community coding session: Codefest 2010 …

Evaluating key-value and document stores for short read data – Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata. A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. … I decided to do an in-depth evaluation of three stores: Tokyo Cabinet, and its network server Tokyo Tyrant, using the pytyrant library; CouchDB, using the couchdb-python library; MongoDB, using pymongo.

Python CSV Validator Library

Posted: 21 July 2011 by Alistair Miles in Uncategorized
Tags: , , , ,

As part of ongoing data quality-assurance work for MalariaGEN’s P. falciparum Genome Variation project, I’ve written a small Python library called csvvalidator for validating data in CSV files or similar row-oriented data sources.

The source code for csvvalidator is on github, and you call find csvvalidator on the Python package index (so you can do easy_install csvvalidator).

Here’s a simple example: