Background: Clinical Data Curation in MalariaGEN and WWARN
In both MalariaGEN‘s Consortial Projects and WWARN we’ve been involved in aggregating clinical data from different studies and research groups, and a big challenge is dealing with heterogeneity in the source data. There is heterogeneity at multiple levels. We see a variety of file formats. Mostly the data are laid out as columnar tables, but we also see some weird and wonderful layouts. Then there is variety in how the tables are designed – some prefer relatively flat tables with one row per patient, others prefer one row per clinical event, observation or visit. And then there is a lot of diversity in which variables (like temperature, parasitaemia, etc.) have been recorded, how the variables have been named, what units have been used, etc. Finally, top that all off with plenty of subtlety in the semantics of the variables and the data (how was the temperature measured?).
The general approach through this morass is to design a standard schema for the data, with a well-defined set of variables. A transformation is then designed for each of the source datasets, mapping the data onto the standard schema.
The problem we have is that designing a transformation for each of the source datasets is a time-consuming task, requiring expertise on the part of the curator in data transformation techniques as well as lots of knowledge about the domain and experience of different ways of representing the data. These skills don’t often come together in one person. We’ve made various attempts at developing software tools that make designing transformations much easier and less technical, but we certainly don’t have it solved.
The other day I realised what now seems blindingly obvious, which is that SQL and relational views provide a declarative language and tool for designing transformations on columnar tables. This is still not the holy grail of a non-programmer’s tool for designing data transformations, but I thought if I could describe some transformation patterns, along with examples in SQL, that would take us a step in the right direction.
Now, rather than start with the easy stuff like converting temperature in Fahrenheit to temperature in Celsius, or multiplying two columns together, I thought I’d start with the harder cases involving transformations on time series data. Below are a couple of patterns with some SQL, this is not exhaustive by any means, but hopefully an interesting start.