Archive for August, 2011

Here at MalariaGEN, we use MySQL extensively, and there are myriad nice GUI tools for accessing it from our Ubuntu desktops. However, we also use Microsoft SQL Server for some of our particularly large laboratory data, and we wanted to access MS SQL Server databases on Ubuntu (11.04 Natty Narwhal) with a GUI, preferably with open source software.

Here is how to set up one such tool (SQuirrel SQL). Note that we will install the application system wide; it is also possible to install it in your home directory, and to create the custom launcher in .local/share/applications if you like. We’re focussed on the install process on Ubuntu 11.04 with Unity, but these instructions should work on other modern linux distros without too much modification. (more…)

dataMerger: Bringing Data Together

Posted: 5 August 2011 by Lee Hart in Uncategorized

It makes life easier when you have all of your data in one place, but it also helps if you can get all of your data into one simple file, ready for analysis. We already have a good set of tools in our web-based CDMS app Topheno, which helps us gather and transform clinical data into a standard set of fields. Over time, multiple standardised data files are produced for different studies, as well as for the same study. However, some curation is usually required to consolidate these clinical data into one resource, so that the amalgamated data can be compared with genotyping results. To help with this amalgamation process, I built a web application called dataMerger, which can import CSV data files and generate one flat file containing all of these data combined into one table. One complication is that the data from separate sources can often disagree, so a significant part of the dataMerger application is geared up to help us identify and resolve any data conflicts.

As an example of what this application can do, imagine that we have two files containing some overlapping data for a set of individuals.

File 1:
ID Name Location DOB Telephone
101 Billy London 1st April 1964 020 7123 1234
102 Bob Paris 2nd June 1978 01 23 45 67 89
103 Sally New York 3rd August 1939  
104 Jane Rome 4th May 1946 06 1234 1234


File 2:
ID Name Location Email Telephone
103 Sally Oxford 01865 123456
104 Jane Oxford 01865 123456
105 Pete Bamako 223 12345678
106 Fred Bangkok 02-1234567


The dataMerger application allows us to easily combine these data into one file, resolving any conflicts and incompleteness along the way. The output contains data from both sources, depending on decisions made by the user. The data provenance is also recorded.

ID Name Location DOB Email Telephone
101 Billy London 1st April 1964   020 7123 1234
102 Bob Paris 2nd June 1978   01 23 45 67 89
103 Sally Oxford 3rd August 1939 01865 123456
104 Jane Oxford 4th May 1946 01865 123456
105 Pete Bamako 223 12345678
106 Fred Bangkok 02-1234567


In this example, you can see that the location for Sally is “Oxford” in the final output, despite it being “New York” in the first source file. This is the result of an explicit decision made by the user to prefer the value from the second source file, wherever such conflicts occur. Similarly, the telephone number for Sally has been taken from second source file, where no value exists in the first source file. All of the decisions for conflict resolutions are configurable and recorded alongside the output.

The source code for this application is open and freely available from GitHub:

This tool uses Java servlets, JSP, Maven, MySQL, JavaScript (JQuery, JSON, AJAX), CSS and XHTML. Maven was used as a convenience, but is not essential. This application also makes use of Andrew Valums’ rather nifty JQuery plug-in for uploading files, which is also freely available and open source (GPL),

The software development itself covered a wide range of topics, including:

  • requirements gathering and issue tracking;
  • technology choice, architecture choice, development approaches;
  • web application security, user-access schemes, resource sharing;
  • off-the-shelf versus tailor-made, open source integrations;
  • user interface design, REST, user experience, workflow;
  • import / export of data file formats, cross-platform compatibility;
  • dynamic database structures, data storage efficiency;
  • strategies for handling data conflicts, nulls and missingness;
  • database query performance and benchmarking procedural algorithms;
  • balancing scalability with urgency and purpose-built engineering;
  • balancing portability with close-coupling and interoperability; and
  • software versioning, data provenance, deployment strategies.


I had a lot of fun working on this project and I learnt a lot along the way too. 🙂

I just spent a very pleasant afternoon catching up with colleagues at the Image Bioinformatics Research Group, based in the department of Zoology here in Oxford. Here’s a few tidbits I picked up …

Tanya Gray is working on the MIIDI standard (Minimum Information for an Infections Disease Investigation) and associate tools. She’s done some very nice work on a MIIDI metadata editor, using eXist and Orbeon forms, with her own additions to generate XForms from an annotated XML Schema. Tanya’s also working on the DryadUK project, which is a data repository supporting publication of data associated with journal articles.

Stephen Wan (visiting from CSIRO) has developed a cool extension for Firefox (and now Chrome) called IBES (In-Browser Elaborative Summariser). If you point it at Wikipedia, for each link you hover over it shows a summary of the page at that link, built intelligently from the link’s context. Then if you navigate to the link, it tells you where you came from. Very handy if (like me) each visit to Wikipedia is a rambling journey, and you often forget why you went there in the first place. He’s also done some related work to help navigate citations in scholarly articles, called CSIBS (The Citation-Sensitive In-Browser Summarizer).

Alex Dutton is working on the JISC Open Citations project. He has some nice visualisations of citation networks (although one of the articles in that graph looks like it cites itself – if only that were possible :). The graphs are generated using dot from RDF representation of metadata from the PubMedCentral Open-Access journal articles. All of the usual dot options are available, so you can play with how the networks get rendered. The whole site is driven by SPARQL, and the bottom of each page shows the SPARQL queries used to generate the page content, so you can see what’s going on under the hood.

Bhavana Ananda is working on the JISC DataFlow project, the DataStage component of which is a follow-on from previous work by Graham Klyne on the Admiral project. I think the philosophy of simple tools to help research groups manage and share their data with each other has a lot of traction, and I think it’s great they’ve got funding to turn the Admiral prototypes into something more.

Graham Klyne is embroiled in the Workflow 4Ever project, and we had a great chat about possible connections with managing our Plasmodium SNP discovery and genotyping pipelines for MalariaGEN. I’m now expecting Graham to solve all my problems.

And David Shotton (group head) is, as always, making it all happen. It was great to raise my head above the trenches for a few hours, I need to do that more often.

I just stumbled upon Brad Chapman’s Blue Collar Bioinformatics blog, it looks like a great resource, here’s a few tidbits…

Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog – Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions. …

Parallel upload to Amazon S3 with python, boto and multiprocessing – One challenge with moving analysis pipelines to cloud resources like Amazon EC2 is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the HiSeq and decreasing sequencing costs, the data transfer question isn’t going away soon. The use of Amazon in bioinformatics was brought up during a recent discussion on the BioStar question answer site. Deepak’s answer highlighted the role of parallelizing uploads and downloads to ease this transfer burden. Here I describe a method to improve upload speed by splitting over multiple processing cores. …

Next generation sequencing information management and analysis system for Galaxy – Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses. Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy. …

CloudBioLinux: progress on bioinformatics cloud images and data – My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with: A permanent web site at; Additional software and genomic data; New user documentation; A community coding session: Codefest 2010 …

Evaluating key-value and document stores for short read data – Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata. A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. … I decided to do an in-depth evaluation of three stores: Tokyo Cabinet, and its network server Tokyo Tyrant, using the pytyrant library; CouchDB, using the couchdb-python library; MongoDB, using pymongo.