Archive for June, 2011

Background: Clinical Data Curation in MalariaGEN and WWARN

In both MalariaGEN‘s Consortial Projects and WWARN we’ve been involved in aggregating clinical data from different studies and research groups, and a big challenge is dealing with heterogeneity in the source data. There is heterogeneity at multiple levels. We see a variety of file formats. Mostly the data are laid out as columnar tables, but we also see some weird and wonderful layouts. Then there is variety in how the tables are designed – some prefer relatively flat tables with one row per patient, others prefer one row per clinical event, observation or visit. And then there is a lot of diversity in which variables (like temperature, parasitaemia, etc.) have been recorded, how the variables have been named, what units have been used, etc. Finally, top that all off with plenty of subtlety in the semantics of the variables and the data (how was the temperature measured?).

The general approach through this morass is to design a standard schema for the data, with a well-defined set of variables. A transformation is then designed for each of the source datasets, mapping the data onto the standard schema.

The problem we have is that designing a transformation for each of the source datasets is a time-consuming task, requiring expertise on the part of the curator in data transformation techniques as well as lots of knowledge about the domain and experience of different ways of representing the data. These skills don’t often come together in one person. We’ve made various attempts at developing software tools that make designing transformations much easier and less technical, but we certainly don’t have it solved.

The other day I realised what now seems blindingly obvious, which is that SQL and relational views provide a declarative language and tool for designing transformations on columnar tables. This is still not the holy grail of a non-programmer’s tool for designing data transformations, but I thought if I could describe some transformation patterns, along with examples in SQL, that would take us a step in the right direction.

Now, rather than start with the easy stuff like converting temperature in Fahrenheit to temperature in Celsius, or multiplying two columns together, I thought I’d start with the harder cases involving transformations on time series data. Below are a couple of patterns with some SQL, this is not exhaustive by any means, but hopefully an interesting start.



In the world of open source software, licenses like GPL, LGPL, MIT, etc., are generally viewed as a good thing, as they allow the authors of the software to place limited restrictions on the re-use of software according to their preference, whilst still being able to publish the source code. Similarly, for other creative works and open access publishing, the Creative Commons licenses are generally viewed as beneficial, because they allow authors to protect the integrity of their work if desired along with their right to attribution, but not otherwise limit access to or re-use of their work.

So what about scientific data? In MalariaGEN, we are developing policies for “community projects” where partners from independent research institutions around the world to submit samples for sequencing. Ultimately, we would like to make all of the data derived from sequencing those samples available to the scientific research community, but we would also like to protect our partners investment in collecting those samples by ensuring they are attributed when data are re-used. So, I thought, surely the best way to do this is to publish the data under a CC-like license, right?

It turns out this is not the current consensus. Science Commons have published a Protocol for Implementing Open Access Data, which (in section 5) has a good explanation of why using intellectual property rights (i.e., licenses) to enforce norms of attribution or share-alike is a bad idea. So the protocol states that:

[…] to facilitate data integration and open access data sharing, any implementation of this protocol MUST waive all rights necessary for data extraction and re-use […] and MUST NOT apply any obligations on the user of the data or database such as “copyleft” or “share alike”, or even the legal requirement to provide attribution.

This is consistent with policies adopted by major scientific data publishers like the European Nucleotide Archive (ENA), e.g.:

The INSD will not attach statements to records that restrict access to the data, limit the use of the information in these records, or prohibit certain types of publications based on these records. Specifically, no use restrictions or licensing requirements will be included in any sequence data records, and no restrictions or licensing fees will be placed on the redistribution or use of the database by any party.

However, the Science Commons protocol also says that:

Any implementation SHOULD define a non-legally binding set of citation norms in clear, lay-readable language.

I found a short article about how to mount a truecrypt volume from a shell script without showing password in process list, see also the comments which provide some alternatives and clarification.

I use…

echo $password | truecrypt -t -k "" --protect-hidden=no /path/to/ /media/truecrypt1

The -t option makes TrueCrypt work in text mode so you can pipe the password to a terminal prompt, and the other options (-k "" --protect-hidden=no) prevent TrueCrypt giving additional prompts which would otherwise confuse the piping of the password variable.

I have a confession to make: I really like Ubuntu’s design, its look and feel, and its colour scheme. And it seems to get better with each release. The new Natty theme is really beautiful, and the dark window decorations of the Ambiance theme are great (especially now that they’ve chased down the odd dark text on dark background problems).

Having said that, there’s one thing that I really don’t like: the way that it’s virtually impossible to tell what you have selected in windows other than the one that happens to be focused. This is because the focused elements get completely desaturated, like this:

Which items are selected? It's anyone's guess!


Last year I did a user management and single sign-on (SSO) implementation for the Worldwide Anti-malarial Resistance Network (, after much clattering about with various pieces of software I settled on the following.

Drupal has built-in features for user registration and management, and we had already decided to use Drupal as the CMS for static content, so it made sense to try and re-use these capabilities. Alongside Drupal we deployed CAS, which acts as an SSO authentication server. Drupal and CAS can be made to talk to each other using the Drupal/CAS module, so we installed and configured this. This changes the behaviour of the normal login and logout links in Drupal, so you’re redirected to the CAS login/logout screens instead, and handles ticket validation after you’ve successfully logged in to CAS and been redirected back to Drupal. Rather than deploy a separate user directory or database, we used the Drupal database, i.e., we configured CAS to query the Drupal database directly via JDBC when checking user login credentials.

Any Java web applications that we wanted to integrate with the CAS SSO service were integrated using the Spring Security CAS implementation. Any other web applications could be integrated using mod_auth_cas, if running as a CGI-style application under Apache, or if running behind Apache as a reverse proxy by using the pre-authentication pattern.

The only coding needed to make this all happen was the cosmetic work needed to make the CAS login and logout screens look like part of the same website, which was fairly straightforward.

The diagram below is a bit rough around the edges, but hopefully it gives an outline of how this is all setup.

Note that there are no restrictions on where any of these components are hosted. I.e., CAS, Drupal, and the webapps could all be hosted on different servers on different networks, or all on the same computer, it doesn’t matter, as long as they can talk to each other.

For WWARN we also used Drupal to define and assign roles to users, which were then used by other applications to implement authorisation policies, but this isn’t necessary to achieve SSO authentication. To use Drupal for role management the other applications also had to query the Drupal database, however I believe this could also be achieved via SAML attribute release, which would remove the need for extra JDBC communication.

I’ve had Sun GridEngine running on our cluster of 12-core HP blades from its earliest days. What has not been working is the the inter-host communication (the ability of the system to schedule and distribute jobs across the nodes). I therefore set out to fix this situation. It turns out that the problems that prevented this from working are mainly caused by quirks in the way that the Debian (and by inheritance, Ubuntu) packaging was done. (more…)


Posted: 1 June 2011 by magnusmanske in Software
Tags: , , , , , ,

Previously, this blog has talked about genotyping, SNP-o-matic, and how to find genetic variants in our DNA samples. But what data is actually used to make these variant “calls” on? And what does this data look like, in its raw, “un-called” form? Does variant discovery rely exclusively on fancy software, or can the good ol’ Mark I eyeball still uncover hidden genetic features? (more…)