The Anopheles gambiae 1000 genomes project is presenting us with some technical challenges, as genetic diversity within the mosquito populations we are studying is extremely high. Although the A. gambiae reference genome (~250Mb) is an order of magnitude smaller than the human genome, we still discover about 100 million SNPs, of which about half pass a reasonably conservative set of filters, which works out to about 1 good SNP every 5 bases or so.

Doing any kind of exploratory analysis of a dataset of ~100 million SNPs genotyped across ~1000 samples is difficult, and working directly from VCF files is impractical, because of the time it takes to parse. Genotype calls can be represented as two-dimensional arrays of numerical data, and there are a number of well-established and emerging software tools and standards for dealing in a generic way with large multi-dimensional arrays, so we’ve been doing some investigation and trying to leverage this work to speed up our analysis workflow.

In particular, the HDF5 format is well supported, and we’ve got a lot of mileage out of it already. I’ve been working on a package called vcfnp which provides support for converting data from a VCF file first into NumPy arrays, and from there to HDF5. You have to make some choices when loading data into an HDF5 file, in particular what type of compression to use, and how to chunk the data. In order to make an informed decision, I did some benchmarking, looking at performance under a number of access scenarios, comparing different compression options and chunk layouts.

The main finding was that using a chunk size of around 128kb, and a fairly narrow chunk width of around 10, provides a very good compromise solution, with good read performance under both column-wise and row-wise access patterns. While other compression options are available and are slightly faster, gzip is very acceptable, and is more widely supported, so we’ll be sticking with that for now. See the notebook linked above for the gory details.


Anopheles gambiae 1000 genomes project

Posted: 22 October 2014 by Alistair Miles in Uncategorized

Back in June we officially launched the Anopheles gambiae 1000 genomes project, which is a consortial project generating and analysing whole genome sequence data on wild-caught mosquitoes of the species Anopheles gambiae and Anopheles coluzzii, the major vectors of Plasmodium falciparum malaria in Africa.

Along with the initial web page, we also made our first data release. The phase 1 preview release contains genotype data on 103 mosquitoes from Uganda, contributed by Martin Donnelly and David Weetman of the Liverpool School of Tropical Medicine. VCF files are available to download from the Ag1000G public FTP site, and there is also an early version of the Panoptes web application which provides an interactive environment for exploring the data.

The consortium is currently working hard on preparing and analysing the full phase 1 dataset, which comprises 765 samples from 8 countries spanning sub-Saharan Africa. We hope to release at least a beta version of these data before the end of the year, I’ll post here when it’s available.

Bioinformatics jobs

Posted: 31 October 2013 by Alistair Miles in Jobs

Join the MalariaGEN team! We’re currently recruiting bioinformatics positions, see the MalariaGEN jobs web page for further details and how to apply. The closing date for applications is 4 November.

We’re primarily looking for bioinformaticians to join the methods development team, which works on evaluating methodologies for processing next-generation sequence data and analysing genetic variation. We are currently working with deep sequence data for approaching 3,000 Plasmodium samples and over 1,000 Anopheles samples, and a human resequencing project is just getting underway. So we are up to our eyeballs in data, and need people who have a keen eye for sifting the signal from the noise.

If you have any questions about the roles, please feel free to contact me.

A short video on the problem of anti-malarial drug resistance and the role of genome sequencing in parasite surveillance.

Video  —  Posted: 5 September 2013 by Alistair Miles in Uncategorized

Recently Olivo Miotto and members of the MalariaGEN teams at Oxford and Sanger, in collaboration with teams studying malaria at 10 locations in West Africa and Southeast Asia, published a paper on multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia. To present the findings at this week’s Royal Society Summer Science Exhibition the MalariaGEN communications team have put together a short animation, enjoy!

Video  —  Posted: 3 July 2013 by Alistair Miles in Uncategorized

I’ve created a liftover chain file to migrate genomic data from the “version 2” 3D7 reference genome to the newer “version 3” reference genome. You can download the chain file at the link below, as well as a binary for the liftOver program compiled for x86_64:

To check it works, download the above and test.bed to a local directory then run:

chmod +x ./liftOver
./liftOver test.bed 2to3.liftOver test.v3.bed test.v3.unmapped

This should create the file test.v3.bed containing:

Pf3D7_07_v3	403620	403621	crt

Note that this expects chromosome names in the input to be like “Pf3D7_01”. If you’re using chromosome names like “MAL1” you’ll need to convert those first prior to applying the liftover to version 3.

Read the rest of this entry »

Load data from a VCF file into numpy arrays

Posted: 22 February 2013 by Alistair Miles in Uncategorized

I’ve recently been doing some analysis of SNPs and indels from the MalariaGEN P. falciparum genetic crosses project, and have found it convenient to load variant call data from VCF files into numpy arrays to compute summary statistics, make plots, etc.

Attempt 1: vcfarray

I initially wrote a small Python library for loading the arrays based on the excellent PyVCF module. This works well but is a little slow, and when I profiled it it was the VCF parsing that was the bottleneck, so I went in search of a C/C++ library I could use from Cython…

Attempt 2: vcfnp

Erik Garrison’s vcflib library provides a nice C++ API for parsing a VCF file, so I had a go at writing a Cython module based on that. Performance is better, I get roughly 2-4X speed-up over the PyVCF-based implementation, although I was hoping for an order of magnitude … I guess it’s just the case that string parsing is relatively slow, even in C/C++, and we should be using BCF2.

To install and try vcfnp for yourself, do:

pip install vcfnp

See the vcfnp README for some examples of usage.

Visualisation software round-up

A common need that we have is to directly view, or interpretively visualise information (both numeric and categoric) that is attached to a particular point on genomic sequence, often in relation to some attributes of that sequence. The number of file formats and tools that have been written for doing this surprised me when I first looked. This post is the first step in looking at what is out there. For this purpose I’m limiting myself to looking at tools that read the popular VCF format (Not to be confused with v-Card). This will only scratch the surface – a quick look at this list shows there is more bioinformatics software than you can shake a double helix at.

IGV – ‘Integrative Genomics Veiwer’

IGV is a java app, that loads a veritable kitchen sink of formats. It is integrated with the Java Web Start system that allows launching of a Java app with ‘one-click’ from your web browser. Files can be loaded from disk or over http/ftp/DAS or from a curated set that the app has metadata for. The state of the app on startup can be specified by command line args or XML config. Combined with a php script that cooks up custom Web Start files one can actually link to a specific view on a specific dataset (if that data is public). This gives web-app style linking, albeit with a bit of a wait and no access control.

I’m only interested at the moment in using IGV to look at SNP data from VCF files – it does much more than this, for example reads from BAM files. Before loading you need to pre-process the VCF to create an index using igvtools which is accessed from the ‘File’ menu in IGV. Indexing our VCF files originally failed – IGV complained that they did not comply with the the VCF4.0 spec as they had whitespace in the INFO field. I confirmed this with VCF tools – in fact the error message from IGV was more instructive as it had the line number of the problem. This is defiantly the fault of our systems and something I hope we can eradicate through better automated testing and persuading people that sticking to standards is in their best interest. For now I just fixed this by truncating the file before the problem.

Firstly one picks the reference genome onto which the VCF file will be mapped. IGV comes with quite a selection available in its curated set, or one can be loaded. The VCF will need the same chromosome names or it will not map. For example I had to pick an old Plasmodium reference as our VCF had the old ‘MAL1’ style chromosome names. IGV is very flexible in what it will load – one could add extra columns to the VCF and have them displayed along side.

IGV InterfaceOnce loaded one is presented with a layout with base as the X-axis and sample as the Y, you can drag around and use the arrow keys to move left right or use the stylised scroll bar at the top. I couldn’t use the mouse wheel to zoom 😦 but you can use Ctrl-+.  The app keeps a record of your locations which you can navigate with the forward and back buttons. You can skip to a point or gene label using the search box at the top – this auto-completes and will give you a list to pick from if it gets more than one match. I managed to make the app hang by searching for a single letter though.

IGV Pop-upHovering over any point gives information about that point in a window that disappears as soon as you move away – making you feel like you’re playing some kind of steady hand game. This also means that you can’t cut and paste that info out of it. There is a toolbar button that replaces the pop-up with a separate window with text, but I couldn’t copy out of that either. Right-clicking brings up a context menu that lets you sort by the selection or change how it displayed, for example switching between colouring for allele or genotype. As far as I can see the display is always relative to the reference genome. Although you can mark regions of interest you can’t pick a set of SNP positions and then just view those without the intervening bases, or order them by any criteria but genomic position. Above the individual samples is a summary section which for each position shows a small bar which is coloured in proportion the samples’ genotype distribution.

The code is on github (yay!) and appears to be under active development. In summary IGV is a flexible tool for viewing data, but does not offer any tools specifically for exploring variation through SNPs as in our use case.

VARiation Browser

VARB is a C++/QT app that only views VCF files. It is distributed as a binary but with shared linking to QT so I had to ‘apt-get libqt’ before it would start. The source is distributed as a zip file so I can’t tell if it is under active development or submit changes as anything but a patchfile. VARB loads requires three files, a reference in FASTA format, an annotation file in GFF format and finally the VCF. I used the FASTA from here and the GFF from the VARB example files. In loading our malformed VCF VARB also failed but did not provide any clue beyond saying that the file was malformed.

VARB offers the same kind of navigation as IGV, again no mouse-wheel and strangely zooming is relative to the left edge. SNPs can disappear and re-appear as one zooms as the rasterisation algorithm doesn’t cope with sparse SNPs on zoomed out regions. The controls and drawing appear to run in the same thread which makes navigation hard. There is an annotation search, but with no complete. The selection tool was much more useful however with the details coming up in the sidebar and easily copied as clicking makes the details stick in the window until cleared.

As well as the information from the VCF VARB adds some analytical output at the bottom of the window. This is fixed to the GC density, Relative variant density, Fst and Tajima’s D, these are updated as one changes the quality, depth and SNP type filters on the left. The windows used for calculating these are fixed and zoom independent. Samples can be grouped, and this grouping is used for the Fst calculation – although I’m not sure how it works out Fst for more than one group. As in IGV there is no way to view the SNPs or samples in any way but sequence order and with separation. The colours can be re-assigned – I found that setting the reference allele colour to white let me see the variation much more clearly. With a few tweaks VARB could be a very use-able SNP browser.


BAMSeek isn’t so much a visualisation tool as it is a file inspection tool. It is distributed as a JAR file with source on Google Code. It supports quite a few formats and is primarily designed for loading large files as it indexes, and then pages, the file as needed.  Anyone who has used a normal text editor will know the pain of large files (I have found Sublime Text handles them well though after a slightly long loading).  BAMSeek successfully loaded our off-spec VCF file – probably as does not fully parse it in order to display its textual content. The VCF file is simply displayed in a table with the header in a separate section. The paging is done by having actual pages that you flip through with a control on the bottom. The line numbers on the left are relative to the page – which is a little frustrating as to get the actual line number you have to do ((page-1)*(rows_per_page)+line_no) in your head. Hovering over a cell gives you the information formatted vertically. There’s not much more to it than that!

Next time we’ll look at some web-based apps that do a similar job.

These days, virtualisation is all the rage. The various competing virtualisation products have reached a level of maturity where they can be reliably used for server consolidation. VirtualBox is one of the easiest to use, most featureful programs available in this space and with the ability to run on many different OSes on hardware with or without VM extensions, it is also one of the most popular. However, there is one wrinkle when it comes to using it for server consolidation: the proprietary RDP/USB2 extension pack.

The conventional wisdom when running a headless server with VirtualBox is that you need to install this proprietary extension pack from Oracle. This is fine until you want to use the server in production: as the PUEL only covers you for personal use and evaluation, you must purchase licenses. You can either pay £34 per user or £670 per “socket” (which has quite a convoluted definition). This gets you USB2 and RDP support.

However, there is another way, at least when it comes to RDP support. Read the rest of this entry »

Update 2012-04-25: mysql workbench has now appeared in the universe package archive. You should be able to install it with a simple:

sudo apt-get install mysql-workbench

Read on if you still want to compile from source.

Right now (2012-04-04), Ubuntu 12.04 hasn’t been released yet, and so there is no binary package from Oracle of MySQL Workbench for Precise. I managed to get the MySQL Workbench binaries for Oneiric to run, by manually installing libzip1_0.9.3-1_amd64.deb from Oneiric, but this wasn’t stable (crashed as soon as I tried to run a SQL Query).

So I decided to build from source. Here’s how I did it Read the rest of this entry »