We use Sun Grid Engine here at WTCHG for managing our compute resources. Many of the analyses I’m doing are best run as array jobs, which generally works very well, but sometimes one or more tasks will fail for one reason or another, and I’ve been casting around for best practice when it comes to (a) verifying which tasks succeeded and which failed, and (b) re-running failed tasks.

I found a nice post by Shiran Pasternak on resubmitting failed SGE array tasks, however Shiran doesn’t say how he determined which tasks had failed, and the set of tasks to rerun is specified manually. I have thousands of tasks in each array job, and so I really need an automated way of determining the success/failure of each task and rerunning those that failed.

I came up with the following pattern.

Read the rest of this entry »

Jobs – Bioinformatics

Posted: 23 March 2012 by Alistair Miles in Jobs
Tags:

We’re advertising bioinformatics jobs at both Oxford and Sanger (near Cambridge), see the following links for job descriptions and information on how to apply:

Here’s an excerpt from the job description:

Overview of role

All MalariaGEN projects working on parasite and vector biology depend on next-generation sequencing. Over 2,000 samples of parasite DNA have been sequenced, and at least 10,000 samples will have been sequenced by 2015. Genome sequencing has been carried out on approximately 200 Anopheles samples to date, and the aim is to sequence approximately 2,500 individuals over the next 4 years. Most parasite samples have been extracted directly from infected blood samples, and so present additional complexities such as small quantities of DNA and mixed infection.

Raw next-generation sequence data is the beginning of a complex and intellectual demanding analysis process. The primary goal is to discover robust evidence for genetic variation. However, building from raw sequence data to robust variation data is and will continue to be one of the most significant challenges facing the malaria research community over coming years. Working to iteratively improve the quality of our genetic variation data and reach deeper into the Plasmodium and Anopheles genomes is the main focus of the MalariaGEN Bioinformatician roles.

This is an extremely fast-paced area of current research and development, and new methods and tools are emerging from many leading research groups and projects, many of whom we have close contacts with. However, we have to strike a balance between looking to the future, and delivering data to MalariaGEN partners that might not be perfect or complete but which nevertheless provides a highly valuable research tool for a range of studies, such as genotype-phenotype association studies, and studies of parasite and vector population structure and dynamics.

To achieve this balance between methods development on the one hand, and production of data on the other, our bioinformatics programme is organised around two working groups. The methods development group is focused on the development, exploration and thorough evaluation of new methods, including methods for sequence alignment, variation calling and genotyping, working closely with statisticians. The production group is focused on establishing tightly specified data analysis pipelines and using them to produce high quality variation data in a reproducible way. Both working groups work to a quarterly data release cycle, where the methods development looks ahead to the next release and determines the best available methods, which are then adopted and implemented by production.

While this role may focus more on methods development or production at different times, we encourage participation in both working groups, as there are important insights that can only be gained by working across both.

Jobs – Scientific Software Engineering

Posted: 23 March 2012 by Alistair Miles in Jobs
Tags:

We’re advertising for software engineers, the job title says “scientific” but no previous experience of scientific programming is required, applications are very welcome from anyone with a strong software engineering background and an interest in the life sciences and/or public health.

We’re advertising at both Sanger (near Cambridge) and Oxford, see the following links for job descriptions and information on how to apply:

Here’s a snippet from the job ad:

MalariaGEN aims to produce global data on natural genetic variation in parasite, mosquito and human populations, and to deliver these data via the MalariaGEN website, alongside web tools which add value by enabling people to explore, understand and analyse the data. Some of these web and data products are intended for unrestricted use by the malaria research and public health community, to inform future research directions and malaria control policy. Other web and data products are being developed for private use by researchers contributing to MalariaGEN community projects, and provide a key incentive to participation in MalariaGEN, e.g., secure web tools providing access to fine-grained genetic data on individual samples.

We have a unique opportunity for Software Engineers to take a key role in the development and implementation of software projects relating to MalariaGEN web and data products.

The job description has a bit more background:

MalariaGEN continues to present many challenges that require development of new software applications. These include:

  • Web applications to present and visualise complex data

  • Software for data analysis and analysis pipelines, typically compute-intensive involving terabytes of data

  • Laboratory information management systems (LIMS) to keep track of samples, data and high-throughput experiments

  • Business and collaboration systems to administrate and coordinate a complex global research network, and to enable partners from different institutions to share information and effectively work together

The Web continues to be our primary platform for delivering software applications, and we have specialist expertise in Web application development and Web standards within the team. However we also develop other types of application as the problem requires.

Members of the software engineering team are equally capable of working across all stages of the software project life cycle, from requirements analysis and design through to implementation and testing, and we support the development of skills and experience across these different areas.

All software we develop is or will be released under an open source license. We also make use of existing open source software where possible and actively contribute to a number of open source projects. An interest in open source software and previous experience of participation in open source projects is an advantage.

We are working in a fast-moving area of scientific research, and we are constantly having to innovate. However, we also have a strong focus on the scientific robustness of the products delivered by MalariaGEN, and value a dedication to quality and sound engineering practices.

 

Job – Clinical Data Curator

Posted: 12 March 2012 by Alistair Miles in Jobs, Uncategorized
Tags:

We’re advertising for a clinical data curator, here’s a snippet from the job ad:

Applications are invited for a MalariaGEN Clinical Data Curator to work in a data-sharing community developing new tools to control malaria by integrating epidemiology with genome science.

You will be a member of the MalariaGEN resource centre and will focus on MalariaGEN consortial data. This data relates to three of our human consortial projects; Genetic determinants of resistance to malaria; Genetic determinants of the immune response to malaria and Human genome variation in malaria-endemic regions.

…the actual work involves curating large amounts of clinical data relating to cases of severe malaria from Africa. The data originate from different countries, studies, research groups, … basically the data can be quite heterogeneous, and needs to be carefully managed, standardised and quality controlled to enable the data to be aggregated then analysed.

Dealing with clinical research data on this scale has been an ongoing challenge for our team for many years, and remains a critical part of realising MalariaGEN‘s studies of human resistance to malaria. If you enjoy dealing with real-world data management problems, have an interest in the life sciences and/or public health, and enjoy working with a diverse community of people from different parts of the world, your application would be very welcome.

Further details are available at the link below:

Job – Scientific Product Manager

Posted: 12 March 2012 by Alistair Miles in Jobs
Tags:

Just a brief post to say that we’re advertising for a Scientific Product Manager. This may not be obvious at a glance, but this job is primarily about management of web and data products – previous experience in science is desirable but not necessary, applications are very welcome from anyone with a passion for developing and delivering high quality web and data products. Here’s a snippet from the job ad:

MalariaGEN aims to produce global data on natural genetic variation in parasite, mosquito and human populations, and to deliver these data via the www.malariagen.net website, alongside web tools which add value by enabling people to explore, understand and analyse the data. Some of these web and data products are intended for unrestricted use by the malaria research and public health community, to inform future research directions and malaria control policy. Other web and data products are being developed for private use by researchers contributing to MalariaGEN community projects, and provide a key incentive to participation in MalariaGEN, e.g., secure web tools providing access to fine-grained genetic data on individual samples.

We have a unique opportunity for a Product Manager to take responsibility for the development and delivery of MalariaGEN web and data products relating to genome sequencing, genotyping and population genetic data from Plasmodium, Anopheles and human populations.

The job is being advertised both at Sanger and Oxford because you could be based at either location. Here are the job ads in full:

Here at MalariaGEN, we use MySQL extensively, and there are myriad nice GUI tools for accessing it from our Ubuntu desktops. However, we also use Microsoft SQL Server for some of our particularly large laboratory data, and we wanted to access MS SQL Server databases on Ubuntu (11.04 Natty Narwhal) with a GUI, preferably with open source software.

Here is how to set up one such tool (SQuirrel SQL). Note that we will install the application system wide; it is also possible to install it in your home directory, and to create the custom launcher in .local/share/applications if you like. We’re focussed on the install process on Ubuntu 11.04 with Unity, but these instructions should work on other modern linux distros without too much modification. Read the rest of this entry »

I just spent a very pleasant afternoon catching up with colleagues at the Image Bioinformatics Research Group, based in the department of Zoology here in Oxford. Here’s a few tidbits I picked up …

Tanya Gray is working on the MIIDI standard (Minimum Information for an Infections Disease Investigation) and associate tools. She’s done some very nice work on a MIIDI metadata editor, using eXist and Orbeon forms, with her own additions to generate XForms from an annotated XML Schema. Tanya’s also working on the DryadUK project, which is a data repository supporting publication of data associated with journal articles.

Stephen Wan (visiting from CSIRO) has developed a cool extension for Firefox (and now Chrome) called IBES (In-Browser Elaborative Summariser). If you point it at Wikipedia, for each link you hover over it shows a summary of the page at that link, built intelligently from the link’s context. Then if you navigate to the link, it tells you where you came from. Very handy if (like me) each visit to Wikipedia is a rambling journey, and you often forget why you went there in the first place. He’s also done some related work to help navigate citations in scholarly articles, called CSIBS (The Citation-Sensitive In-Browser Summarizer).

Alex Dutton is working on the JISC Open Citations project. He has some nice visualisations of citation networks (although one of the articles in that graph looks like it cites itself – if only that were possible :). The graphs are generated using dot from RDF representation of metadata from the PubMedCentral Open-Access journal articles. All of the usual dot options are available, so you can play with how the networks get rendered. The whole site is driven by SPARQL, and the bottom of each page shows the SPARQL queries used to generate the page content, so you can see what’s going on under the hood.

Bhavana Ananda is working on the JISC DataFlow project, the DataStage component of which is a follow-on from previous work by Graham Klyne on the Admiral project. I think the philosophy of simple tools to help research groups manage and share their data with each other has a lot of traction, and I think it’s great they’ve got funding to turn the Admiral prototypes into something more.

Graham Klyne is embroiled in the Workflow 4Ever project, and we had a great chat about possible connections with managing our Plasmodium SNP discovery and genotyping pipelines for MalariaGEN. I’m now expecting Graham to solve all my problems.

And David Shotton (group head) is, as always, making it all happen. It was great to raise my head above the trenches for a few hours, I need to do that more often.

I just stumbled upon Brad Chapman’s Blue Collar Bioinformatics blog, it looks like a great resource, here’s a few tidbits…

Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog – Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions. …

Parallel upload to Amazon S3 with python, boto and multiprocessing – One challenge with moving analysis pipelines to cloud resources like Amazon EC2 is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the HiSeq and decreasing sequencing costs, the data transfer question isn’t going away soon. The use of Amazon in bioinformatics was brought up during a recent discussion on the BioStar question answer site. Deepak’s answer highlighted the role of parallelizing uploads and downloads to ease this transfer burden. Here I describe a method to improve upload speed by splitting over multiple processing cores. …

Next generation sequencing information management and analysis system for Galaxy – Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses. Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy. …

CloudBioLinux: progress on bioinformatics cloud images and data – My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with: A permanent web site at cloudbiolinux.org; Additional software and genomic data; New user documentation; A community coding session: Codefest 2010 …

Evaluating key-value and document stores for short read data – Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata. A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. … I decided to do an in-depth evaluation of three stores: Tokyo Cabinet, and its network server Tokyo Tyrant, using the pytyrant library; CouchDB, using the couchdb-python library; MongoDB, using pymongo.

As we’re starting to get into cloud technology in a big way here, I decided to take part in Ubuntu’s IRC training days on the subject, Ubuntu Cloud Days.  While the material covered was often more low-level than what we are likely to use in the short term, I found the sessions on Ensemble and CloudInit to be particularly useful, as they’ll be directly applicable to our upcoming GWAS work on the ec2 cloud.

Read the rest of this entry »

Python CSV Validator Library

Posted: 21 July 2011 by Alistair Miles in Uncategorized
Tags: , , , ,

As part of ongoing data quality-assurance work for MalariaGEN’s P. falciparum Genome Variation project, I’ve written a small Python library called csvvalidator for validating data in CSV files or similar row-oriented data sources.

The source code for csvvalidator is on github, and you call find csvvalidator on the Python package index (so you can do easy_install csvvalidator).

Here’s a simple example:

Read the rest of this entry »