Archive for February, 2013

I’ve created a liftover chain file to migrate genomic data from the “version 2” 3D7 reference genome to the newer “version 3” reference genome. You can download the chain file at the link below, as well as a binary for the liftOver program compiled for x86_64:

To check it works, download the above and test.bed to a local directory then run:

chmod +x ./liftOver
./liftOver test.bed 2to3.liftOver test.v3.bed test.v3.unmapped

This should create the file test.v3.bed containing:

Pf3D7_07_v3	403620	403621	crt

Note that this expects chromosome names in the input to be like “Pf3D7_01”. If you’re using chromosome names like “MAL1” you’ll need to convert those first prior to applying the liftover to version 3.

(more…)

Advertisements

Load data from a VCF file into numpy arrays

Posted: 22 February 2013 by Alistair Miles in Uncategorized

I’ve recently been doing some analysis of SNPs and indels from the MalariaGEN P. falciparum genetic crosses project, and have found it convenient to load variant call data from VCF files into numpy arrays to compute summary statistics, make plots, etc.

Attempt 1: vcfarray

I initially wrote a small Python library for loading the arrays based on the excellent PyVCF module. This works well but is a little slow, and when I profiled it it was the VCF parsing that was the bottleneck, so I went in search of a C/C++ library I could use from Cython…

Attempt 2: vcfnp

Erik Garrison’s vcflib library provides a nice C++ API for parsing a VCF file, so I had a go at writing a Cython module based on that. Performance is better, I get roughly 2-4X speed-up over the PyVCF-based implementation, although I was hoping for an order of magnitude … I guess it’s just the case that string parsing is relatively slow, even in C/C++, and we should be using BCF2.

To install and try vcfnp for yourself, do:

pip install vcfnp

See the vcfnp README for some examples of usage.