Author Archive

The Anopheles gambiae 1000 genomes project is presenting us with some technical challenges, as genetic diversity within the mosquito populations we are studying is extremely high. Although the A. gambiae reference genome (~250Mb) is an order of magnitude smaller than the human genome, we still discover about 100 million SNPs, of which about half pass a reasonably conservative set of filters, which works out to about 1 good SNP every 5 bases or so.

Doing any kind of exploratory analysis of a dataset of ~100 million SNPs genotyped across ~1000 samples is difficult, and working directly from VCF files is impractical, because of the time it takes to parse. Genotype calls can be represented as two-dimensional arrays of numerical data, and there are a number of well-established and emerging software tools and standards for dealing in a generic way with large multi-dimensional arrays, so we’ve been doing some investigation and trying to leverage this work to speed up our analysis workflow.

In particular, the HDF5 format is well supported, and we’ve got a lot of mileage out of it already. I’ve been working on a package called vcfnp which provides support for converting data from a VCF file first into NumPy arrays, and from there to HDF5. You have to make some choices when loading data into an HDF5 file, in particular what type of compression to use, and how to chunk the data. In order to make an informed decision, I did some benchmarking, looking at performance under a number of access scenarios, comparing different compression options and chunk layouts.

The main finding was that using a chunk size of around 128kb, and a fairly narrow chunk width of around 10, provides a very good compromise solution, with good read performance under both column-wise and row-wise access patterns. While other compression options are available and are slightly faster, gzip is very acceptable, and is more widely supported, so we’ll be sticking with that for now. See the notebook linked above for the gory details.


Anopheles gambiae 1000 genomes project

Posted: 22 October 2014 by Alistair Miles in Uncategorized

Back in June we officially launched the Anopheles gambiae 1000 genomes project, which is a consortial project generating and analysing whole genome sequence data on wild-caught mosquitoes of the species Anopheles gambiae and Anopheles coluzzii, the major vectors of Plasmodium falciparum malaria in Africa.

Along with the initial web page, we also made our first data release. The phase 1 preview release contains genotype data on 103 mosquitoes from Uganda, contributed by Martin Donnelly and David Weetman of the Liverpool School of Tropical Medicine. VCF files are available to download from the Ag1000G public FTP site, and there is also an early version of the Panoptes web application which provides an interactive environment for exploring the data.

The consortium is currently working hard on preparing and analysing the full phase 1 dataset, which comprises 765 samples from 8 countries spanning sub-Saharan Africa. We hope to release at least a beta version of these data before the end of the year, I’ll post here when it’s available.

Bioinformatics jobs

Posted: 31 October 2013 by Alistair Miles in Jobs

Join the MalariaGEN team! We’re currently recruiting bioinformatics positions, see the MalariaGEN jobs web page for further details and how to apply. The closing date for applications is 4 November.

We’re primarily looking for bioinformaticians to join the methods development team, which works on evaluating methodologies for processing next-generation sequence data and analysing genetic variation. We are currently working with deep sequence data for approaching 3,000 Plasmodium samples and over 1,000 Anopheles samples, and a human resequencing project is just getting underway. So we are up to our eyeballs in data, and need people who have a keen eye for sifting the signal from the noise.

If you have any questions about the roles, please feel free to contact me.

Patterns of resistance by Joel Winston

Posted: 5 September 2013 by Alistair Miles in Uncategorized

A short video on the problem of anti-malarial drug resistance and the role of genome sequencing in parasite surveillance.

Malaria Drug Resistance and Genomics Animation

Posted: 3 July 2013 by Alistair Miles in Uncategorized

Recently Olivo Miotto and members of the MalariaGEN teams at Oxford and Sanger, in collaboration with teams studying malaria at 10 locations in West Africa and Southeast Asia, published a paper on multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia. To present the findings at this week’s Royal Society Summer Science Exhibition the MalariaGEN communications team have put together a short animation, enjoy!

I’ve created a liftover chain file to migrate genomic data from the “version 2” 3D7 reference genome to the newer “version 3” reference genome. You can download the chain file at the link below, as well as a binary for the liftOver program compiled for x86_64:

To check it works, download the above and test.bed to a local directory then run:

chmod +x ./liftOver
./liftOver test.bed 2to3.liftOver test.v3.bed test.v3.unmapped

This should create the file test.v3.bed containing:

Pf3D7_07_v3	403620	403621	crt

Note that this expects chromosome names in the input to be like “Pf3D7_01”. If you’re using chromosome names like “MAL1” you’ll need to convert those first prior to applying the liftover to version 3.


Load data from a VCF file into numpy arrays

Posted: 22 February 2013 by Alistair Miles in Uncategorized

I’ve recently been doing some analysis of SNPs and indels from the MalariaGEN P. falciparum genetic crosses project, and have found it convenient to load variant call data from VCF files into numpy arrays to compute summary statistics, make plots, etc.

Attempt 1: vcfarray

I initially wrote a small Python library for loading the arrays based on the excellent PyVCF module. This works well but is a little slow, and when I profiled it it was the VCF parsing that was the bottleneck, so I went in search of a C/C++ library I could use from Cython…

Attempt 2: vcfnp

Erik Garrison’s vcflib library provides a nice C++ API for parsing a VCF file, so I had a go at writing a Cython module based on that. Performance is better, I get roughly 2-4X speed-up over the PyVCF-based implementation, although I was hoping for an order of magnitude … I guess it’s just the case that string parsing is relatively slow, even in C/C++, and we should be using BCF2.

To install and try vcfnp for yourself, do:

pip install vcfnp

See the vcfnp README for some examples of usage.

We use Sun Grid Engine here at WTCHG for managing our compute resources. Many of the analyses I’m doing are best run as array jobs, which generally works very well, but sometimes one or more tasks will fail for one reason or another, and I’ve been casting around for best practice when it comes to (a) verifying which tasks succeeded and which failed, and (b) re-running failed tasks.

I found a nice post by Shiran Pasternak on resubmitting failed SGE array tasks, however Shiran doesn’t say how he determined which tasks had failed, and the set of tasks to rerun is specified manually. I have thousands of tasks in each array job, and so I really need an automated way of determining the success/failure of each task and rerunning those that failed.

I came up with the following pattern.


Jobs – Bioinformatics

Posted: 23 March 2012 by Alistair Miles in Jobs

We’re advertising bioinformatics jobs at both Oxford and Sanger (near Cambridge), see the following links for job descriptions and information on how to apply:

Here’s an excerpt from the job description:

Overview of role

All MalariaGEN projects working on parasite and vector biology depend on next-generation sequencing. Over 2,000 samples of parasite DNA have been sequenced, and at least 10,000 samples will have been sequenced by 2015. Genome sequencing has been carried out on approximately 200 Anopheles samples to date, and the aim is to sequence approximately 2,500 individuals over the next 4 years. Most parasite samples have been extracted directly from infected blood samples, and so present additional complexities such as small quantities of DNA and mixed infection.

Raw next-generation sequence data is the beginning of a complex and intellectual demanding analysis process. The primary goal is to discover robust evidence for genetic variation. However, building from raw sequence data to robust variation data is and will continue to be one of the most significant challenges facing the malaria research community over coming years. Working to iteratively improve the quality of our genetic variation data and reach deeper into the Plasmodium and Anopheles genomes is the main focus of the MalariaGEN Bioinformatician roles.

This is an extremely fast-paced area of current research and development, and new methods and tools are emerging from many leading research groups and projects, many of whom we have close contacts with. However, we have to strike a balance between looking to the future, and delivering data to MalariaGEN partners that might not be perfect or complete but which nevertheless provides a highly valuable research tool for a range of studies, such as genotype-phenotype association studies, and studies of parasite and vector population structure and dynamics.

To achieve this balance between methods development on the one hand, and production of data on the other, our bioinformatics programme is organised around two working groups. The methods development group is focused on the development, exploration and thorough evaluation of new methods, including methods for sequence alignment, variation calling and genotyping, working closely with statisticians. The production group is focused on establishing tightly specified data analysis pipelines and using them to produce high quality variation data in a reproducible way. Both working groups work to a quarterly data release cycle, where the methods development looks ahead to the next release and determines the best available methods, which are then adopted and implemented by production.

While this role may focus more on methods development or production at different times, we encourage participation in both working groups, as there are important insights that can only be gained by working across both.

Jobs – Scientific Software Engineering

Posted: 23 March 2012 by Alistair Miles in Jobs

We’re advertising for software engineers, the job title says “scientific” but no previous experience of scientific programming is required, applications are very welcome from anyone with a strong software engineering background and an interest in the life sciences and/or public health.

We’re advertising at both Sanger (near Cambridge) and Oxford, see the following links for job descriptions and information on how to apply:

Here’s a snippet from the job ad:

MalariaGEN aims to produce global data on natural genetic variation in parasite, mosquito and human populations, and to deliver these data via the MalariaGEN website, alongside web tools which add value by enabling people to explore, understand and analyse the data. Some of these web and data products are intended for unrestricted use by the malaria research and public health community, to inform future research directions and malaria control policy. Other web and data products are being developed for private use by researchers contributing to MalariaGEN community projects, and provide a key incentive to participation in MalariaGEN, e.g., secure web tools providing access to fine-grained genetic data on individual samples.

We have a unique opportunity for Software Engineers to take a key role in the development and implementation of software projects relating to MalariaGEN web and data products.

The job description has a bit more background:

MalariaGEN continues to present many challenges that require development of new software applications. These include:

  • Web applications to present and visualise complex data

  • Software for data analysis and analysis pipelines, typically compute-intensive involving terabytes of data

  • Laboratory information management systems (LIMS) to keep track of samples, data and high-throughput experiments

  • Business and collaboration systems to administrate and coordinate a complex global research network, and to enable partners from different institutions to share information and effectively work together

The Web continues to be our primary platform for delivering software applications, and we have specialist expertise in Web application development and Web standards within the team. However we also develop other types of application as the problem requires.

Members of the software engineering team are equally capable of working across all stages of the software project life cycle, from requirements analysis and design through to implementation and testing, and we support the development of skills and experience across these different areas.

All software we develop is or will be released under an open source license. We also make use of existing open source software where possible and actively contribute to a number of open source projects. An interest in open source software and previous experience of participation in open source projects is an advantage.

We are working in a fast-moving area of scientific research, and we are constantly having to innovate. However, we also have a strong focus on the scientific robustness of the products delivered by MalariaGEN, and value a dedication to quality and sound engineering practices.