Author Archive

dataMerger: Bringing Data Together

Posted: 5 August 2011 by Lee Hart in Uncategorized

It makes life easier when you have all of your data in one place, but it also helps if you can get all of your data into one simple file, ready for analysis. We already have a good set of tools in our web-based CDMS app Topheno, which helps us gather and transform clinical data into a standard set of fields. Over time, multiple standardised data files are produced for different studies, as well as for the same study. However, some curation is usually required to consolidate these clinical data into one resource, so that the amalgamated data can be compared with genotyping results. To help with this amalgamation process, I built a web application called dataMerger, which can import CSV data files and generate one flat file containing all of these data combined into one table. One complication is that the data from separate sources can often disagree, so a significant part of the dataMerger application is geared up to help us identify and resolve any data conflicts.

As an example of what this application can do, imagine that we have two files containing some overlapping data for a set of individuals.

File 1:
ID Name Location DOB Telephone
101 Billy London 1st April 1964 020 7123 1234
102 Bob Paris 2nd June 1978 01 23 45 67 89
103 Sally New York 3rd August 1939  
104 Jane Rome 4th May 1946 06 1234 1234

 

File 2:
ID Name Location Email Telephone
103 Sally Oxford sally@example.org 01865 123456
104 Jane Oxford jane@example.org 01865 123456
105 Pete Bamako pete@example.org 223 12345678
106 Fred Bangkok fred@example.org 02-1234567

 

The dataMerger application allows us to easily combine these data into one file, resolving any conflicts and incompleteness along the way. The output contains data from both sources, depending on decisions made by the user. The data provenance is also recorded.

Output:
ID Name Location DOB Email Telephone
101 Billy London 1st April 1964   020 7123 1234
102 Bob Paris 2nd June 1978   01 23 45 67 89
103 Sally Oxford 3rd August 1939 sally@example.org 01865 123456
104 Jane Oxford 4th May 1946 jane@example.org 01865 123456
105 Pete Bamako   pete@example.org 223 12345678
106 Fred Bangkok   fred@example.org 02-1234567

 

In this example, you can see that the location for Sally is “Oxford” in the final output, despite it being “New York” in the first source file. This is the result of an explicit decision made by the user to prefer the value from the second source file, wherever such conflicts occur. Similarly, the telephone number for Sally has been taken from second source file, where no value exists in the first source file. All of the decisions for conflict resolutions are configurable and recorded alongside the output.

The source code for this application is open and freely available from GitHub: https://github.com/leehart/dataMerger

This tool uses Java servlets, JSP, Maven, MySQL, JavaScript (JQuery, JSON, AJAX), CSS and XHTML. Maven was used as a convenience, but is not essential. This application also makes use of Andrew Valums’ rather nifty JQuery plug-in for uploading files, which is also freely available and open source (GPL), https://github.com/Valums-File-Uploader/file-uploader

The software development itself covered a wide range of topics, including:

  • requirements gathering and issue tracking;
  • technology choice, architecture choice, development approaches;
  • web application security, user-access schemes, resource sharing;
  • off-the-shelf versus tailor-made, open source integrations;
  • user interface design, REST, user experience, workflow;
  • import / export of data file formats, cross-platform compatibility;
  • dynamic database structures, data storage efficiency;
  • strategies for handling data conflicts, nulls and missingness;
  • database query performance and benchmarking procedural algorithms;
  • balancing scalability with urgency and purpose-built engineering;
  • balancing portability with close-coupling and interoperability; and
  • software versioning, data provenance, deployment strategies.

 

I had a lot of fun working on this project and I learnt a lot along the way too. 🙂

Advertisements