Benchmarking HDF5 for storing and accessing large arrays of genotype data

Posted: 22 October 2014 by Alistair Miles in Uncategorized

The Anopheles gambiae 1000 genomes project is presenting us with some technical challenges, as genetic diversity within the mosquito populations we are studying is extremely high. Although the A. gambiae reference genome (~250Mb) is an order of magnitude smaller than the human genome, we still discover about 100 million SNPs, of which about half pass a reasonably conservative set of filters, which works out to about 1 good SNP every 5 bases or so.

Doing any kind of exploratory analysis of a dataset of ~100 million SNPs genotyped across ~1000 samples is difficult, and working directly from VCF files is impractical, because of the time it takes to parse. Genotype calls can be represented as two-dimensional arrays of numerical data, and there are a number of well-established and emerging software tools and standards for dealing in a generic way with large multi-dimensional arrays, so we’ve been doing some investigation and trying to leverage this work to speed up our analysis workflow.

In particular, the HDF5 format is well supported, and we’ve got a lot of mileage out of it already. I’ve been working on a package called vcfnp which provides support for converting data from a VCF file first into NumPy arrays, and from there to HDF5. You have to make some choices when loading data into an HDF5 file, in particular what type of compression to use, and how to chunk the data. In order to make an informed decision, I did some benchmarking, looking at performance under a number of access scenarios, comparing different compression options and chunk layouts.

The main finding was that using a chunk size of around 128kb, and a fairly narrow chunk width of around 10, provides a very good compromise solution, with good read performance under both column-wise and row-wise access patterns. While other compression options are available and are slightly faster, gzip is very acceptable, and is more widely supported, so we’ll be sticking with that for now. See the notebook linked above for the gory details.

Advertisement
Comments
  1. roven rommel fuentes says:

    SNP Seek database for 3000 Rice Genomes uses HDF5 as storage for filtered SNPs 20M x 3k samples. It’s interesting that you have the benchmark for genotype data retrieval. I’m interested on the data model you used and your experience on using HDF5 for large dataset. BTW, we also are developing tool to load full VCF file in HDF5.
    tnx

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s