Visualisation software round-up

A common need that we have is to directly view, or interpretively visualise information (both numeric and categoric) that is attached to a particular point on genomic sequence, often in relation to some attributes of that sequence. The number of file formats and tools that have been written for doing this surprised me when I first looked. This post is the first step in looking at what is out there. For this purpose I’m limiting myself to looking at tools that read the popular VCF format (Not to be confused with v-Card). This will only scratch the surface – a quick look at this list shows there is more bioinformatics software than you can shake a double helix at.

IGV – ‘Integrative Genomics Veiwer’

IGV is a java app, that loads a veritable kitchen sink of formats. It is integrated with the Java Web Start system that allows launching of a Java app with ‘one-click’ from your web browser. Files can be loaded from disk or over http/ftp/DAS or from a curated set that the app has metadata for. The state of the app on startup can be specified by command line args or XML config. Combined with a php script that cooks up custom Web Start files one can actually link to a specific view on a specific dataset (if that data is public). This gives web-app style linking, albeit with a bit of a wait and no access control.

I’m only interested at the moment in using IGV to look at SNP data from VCF files – it does much more than this, for example reads from BAM files. Before loading you need to pre-process the VCF to create an index using igvtools which is accessed from the ‘File’ menu in IGV. Indexing our VCF files originally failed – IGV complained that they did not comply with the the VCF4.0 spec as they had whitespace in the INFO field. I confirmed this with VCF tools – in fact the error message from IGV was more instructive as it had the line number of the problem. This is defiantly the fault of our systems and something I hope we can eradicate through better automated testing and persuading people that sticking to standards is in their best interest. For now I just fixed this by truncating the file before the problem.

Firstly one picks the reference genome onto which the VCF file will be mapped. IGV comes with quite a selection available in its curated set, or one can be loaded. The VCF will need the same chromosome names or it will not map. For example I had to pick an old Plasmodium reference as our VCF had the old ‘MAL1’ style chromosome names. IGV is very flexible in what it will load – one could add extra columns to the VCF and have them displayed along side.

IGV InterfaceOnce loaded one is presented with a layout with base as the X-axis and sample as the Y, you can drag around and use the arrow keys to move left right or use the stylised scroll bar at the top. I couldn’t use the mouse wheel to zoom 😦 but you can use Ctrl-+.  The app keeps a record of your locations which you can navigate with the forward and back buttons. You can skip to a point or gene label using the search box at the top – this auto-completes and will give you a list to pick from if it gets more than one match. I managed to make the app hang by searching for a single letter though.

IGV Pop-upHovering over any point gives information about that point in a window that disappears as soon as you move away – making you feel like you’re playing some kind of steady hand game. This also means that you can’t cut and paste that info out of it. There is a toolbar button that replaces the pop-up with a separate window with text, but I couldn’t copy out of that either. Right-clicking brings up a context menu that lets you sort by the selection or change how it displayed, for example switching between colouring for allele or genotype. As far as I can see the display is always relative to the reference genome. Although you can mark regions of interest you can’t pick a set of SNP positions and then just view those without the intervening bases, or order them by any criteria but genomic position. Above the individual samples is a summary section which for each position shows a small bar which is coloured in proportion the samples’ genotype distribution.

The code is on github (yay!) and appears to be under active development. In summary IGV is a flexible tool for viewing data, but does not offer any tools specifically for exploring variation through SNPs as in our use case.

VARiation Browser

VARB is a C++/QT app that only views VCF files. It is distributed as a binary but with shared linking to QT so I had to ‘apt-get libqt’ before it would start. The source is distributed as a zip file so I can’t tell if it is under active development or submit changes as anything but a patchfile. VARB loads requires three files, a reference in FASTA format, an annotation file in GFF format and finally the VCF. I used the FASTA from here and the GFF from the VARB example files. In loading our malformed VCF VARB also failed but did not provide any clue beyond saying that the file was malformed.

VARB offers the same kind of navigation as IGV, again no mouse-wheel and strangely zooming is relative to the left edge. SNPs can disappear and re-appear as one zooms as the rasterisation algorithm doesn’t cope with sparse SNPs on zoomed out regions. The controls and drawing appear to run in the same thread which makes navigation hard. There is an annotation search, but with no complete. The selection tool was much more useful however with the details coming up in the sidebar and easily copied as clicking makes the details stick in the window until cleared.

As well as the information from the VCF VARB adds some analytical output at the bottom of the window. This is fixed to the GC density, Relative variant density, Fst and Tajima’s D, these are updated as one changes the quality, depth and SNP type filters on the left. The windows used for calculating these are fixed and zoom independent. Samples can be grouped, and this grouping is used for the Fst calculation – although I’m not sure how it works out Fst for more than one group. As in IGV there is no way to view the SNPs or samples in any way but sequence order and with separation. The colours can be re-assigned – I found that setting the reference allele colour to white let me see the variation much more clearly. With a few tweaks VARB could be a very use-able SNP browser.

BAMSeek

BAMSeek isn’t so much a visualisation tool as it is a file inspection tool. It is distributed as a JAR file with source on Google Code. It supports quite a few formats and is primarily designed for loading large files as it indexes, and then pages, the file as needed.  Anyone who has used a normal text editor will know the pain of large files (I have found Sublime Text handles them well though after a slightly long loading).  BAMSeek successfully loaded our off-spec VCF file – probably as does not fully parse it in order to display its textual content. The VCF file is simply displayed in a table with the header in a separate section. The paging is done by having actual pages that you flip through with a control on the bottom. The line numbers on the left are relative to the page – which is a little frustrating as to get the actual line number you have to do ((page-1)*(rows_per_page)+line_no) in your head. Hovering over a cell gives you the information formatted vertically. There’s not much more to it than that!

Next time we’ll look at some web-based apps that do a similar job.

Advertisements

These days, virtualisation is all the rage. The various competing virtualisation products have reached a level of maturity where they can be reliably used for server consolidation. VirtualBox is one of the easiest to use, most featureful programs available in this space and with the ability to run on many different OSes on hardware with or without VM extensions, it is also one of the most popular. However, there is one wrinkle when it comes to using it for server consolidation: the proprietary RDP/USB2 extension pack.

The conventional wisdom when running a headless server with VirtualBox is that you need to install this proprietary extension pack from Oracle. This is fine until you want to use the server in production: as the PUEL only covers you for personal use and evaluation, you must purchase licenses. You can either pay £34 per user or £670 per “socket” (which has quite a convoluted definition). This gets you USB2 and RDP support.

However, there is another way, at least when it comes to RDP support. Read the rest of this entry »

Update 2012-04-25: mysql workbench has now appeared in the universe package archive. You should be able to install it with a simple:

sudo apt-get install mysql-workbench

Read on if you still want to compile from source.

Right now (2012-04-04), Ubuntu 12.04 hasn’t been released yet, and so there is no binary package from Oracle of MySQL Workbench for Precise. I managed to get the MySQL Workbench binaries for Oneiric to run, by manually installing libzip1_0.9.3-1_amd64.deb from Oneiric, but this wasn’t stable (crashed as soon as I tried to run a SQL Query).

So I decided to build from source. Here’s how I did it Read the rest of this entry »

We use Sun Grid Engine here at WTCHG for managing our compute resources. Many of the analyses I’m doing are best run as array jobs, which generally works very well, but sometimes one or more tasks will fail for one reason or another, and I’ve been casting around for best practice when it comes to (a) verifying which tasks succeeded and which failed, and (b) re-running failed tasks.

I found a nice post by Shiran Pasternak on resubmitting failed SGE array tasks, however Shiran doesn’t say how he determined which tasks had failed, and the set of tasks to rerun is specified manually. I have thousands of tasks in each array job, and so I really need an automated way of determining the success/failure of each task and rerunning those that failed.

I came up with the following pattern.

Read the rest of this entry »

Jobs – Bioinformatics

Posted: 23 March 2012 by Alistair Miles in Jobs
Tags:

We’re advertising bioinformatics jobs at both Oxford and Sanger (near Cambridge), see the following links for job descriptions and information on how to apply:

Here’s an excerpt from the job description:

Overview of role

All MalariaGEN projects working on parasite and vector biology depend on next-generation sequencing. Over 2,000 samples of parasite DNA have been sequenced, and at least 10,000 samples will have been sequenced by 2015. Genome sequencing has been carried out on approximately 200 Anopheles samples to date, and the aim is to sequence approximately 2,500 individuals over the next 4 years. Most parasite samples have been extracted directly from infected blood samples, and so present additional complexities such as small quantities of DNA and mixed infection.

Raw next-generation sequence data is the beginning of a complex and intellectual demanding analysis process. The primary goal is to discover robust evidence for genetic variation. However, building from raw sequence data to robust variation data is and will continue to be one of the most significant challenges facing the malaria research community over coming years. Working to iteratively improve the quality of our genetic variation data and reach deeper into the Plasmodium and Anopheles genomes is the main focus of the MalariaGEN Bioinformatician roles.

This is an extremely fast-paced area of current research and development, and new methods and tools are emerging from many leading research groups and projects, many of whom we have close contacts with. However, we have to strike a balance between looking to the future, and delivering data to MalariaGEN partners that might not be perfect or complete but which nevertheless provides a highly valuable research tool for a range of studies, such as genotype-phenotype association studies, and studies of parasite and vector population structure and dynamics.

To achieve this balance between methods development on the one hand, and production of data on the other, our bioinformatics programme is organised around two working groups. The methods development group is focused on the development, exploration and thorough evaluation of new methods, including methods for sequence alignment, variation calling and genotyping, working closely with statisticians. The production group is focused on establishing tightly specified data analysis pipelines and using them to produce high quality variation data in a reproducible way. Both working groups work to a quarterly data release cycle, where the methods development looks ahead to the next release and determines the best available methods, which are then adopted and implemented by production.

While this role may focus more on methods development or production at different times, we encourage participation in both working groups, as there are important insights that can only be gained by working across both.

Jobs – Scientific Software Engineering

Posted: 23 March 2012 by Alistair Miles in Jobs
Tags:

We’re advertising for software engineers, the job title says “scientific” but no previous experience of scientific programming is required, applications are very welcome from anyone with a strong software engineering background and an interest in the life sciences and/or public health.

We’re advertising at both Sanger (near Cambridge) and Oxford, see the following links for job descriptions and information on how to apply:

Here’s a snippet from the job ad:

MalariaGEN aims to produce global data on natural genetic variation in parasite, mosquito and human populations, and to deliver these data via the MalariaGEN website, alongside web tools which add value by enabling people to explore, understand and analyse the data. Some of these web and data products are intended for unrestricted use by the malaria research and public health community, to inform future research directions and malaria control policy. Other web and data products are being developed for private use by researchers contributing to MalariaGEN community projects, and provide a key incentive to participation in MalariaGEN, e.g., secure web tools providing access to fine-grained genetic data on individual samples.

We have a unique opportunity for Software Engineers to take a key role in the development and implementation of software projects relating to MalariaGEN web and data products.

The job description has a bit more background:

MalariaGEN continues to present many challenges that require development of new software applications. These include:

  • Web applications to present and visualise complex data

  • Software for data analysis and analysis pipelines, typically compute-intensive involving terabytes of data

  • Laboratory information management systems (LIMS) to keep track of samples, data and high-throughput experiments

  • Business and collaboration systems to administrate and coordinate a complex global research network, and to enable partners from different institutions to share information and effectively work together

The Web continues to be our primary platform for delivering software applications, and we have specialist expertise in Web application development and Web standards within the team. However we also develop other types of application as the problem requires.

Members of the software engineering team are equally capable of working across all stages of the software project life cycle, from requirements analysis and design through to implementation and testing, and we support the development of skills and experience across these different areas.

All software we develop is or will be released under an open source license. We also make use of existing open source software where possible and actively contribute to a number of open source projects. An interest in open source software and previous experience of participation in open source projects is an advantage.

We are working in a fast-moving area of scientific research, and we are constantly having to innovate. However, we also have a strong focus on the scientific robustness of the products delivered by MalariaGEN, and value a dedication to quality and sound engineering practices.

 

Job – Clinical Data Curator

Posted: 12 March 2012 by Alistair Miles in Jobs, Uncategorized
Tags:

We’re advertising for a clinical data curator, here’s a snippet from the job ad:

Applications are invited for a MalariaGEN Clinical Data Curator to work in a data-sharing community developing new tools to control malaria by integrating epidemiology with genome science.

You will be a member of the MalariaGEN resource centre and will focus on MalariaGEN consortial data. This data relates to three of our human consortial projects; Genetic determinants of resistance to malaria; Genetic determinants of the immune response to malaria and Human genome variation in malaria-endemic regions.

…the actual work involves curating large amounts of clinical data relating to cases of severe malaria from Africa. The data originate from different countries, studies, research groups, … basically the data can be quite heterogeneous, and needs to be carefully managed, standardised and quality controlled to enable the data to be aggregated then analysed.

Dealing with clinical research data on this scale has been an ongoing challenge for our team for many years, and remains a critical part of realising MalariaGEN‘s studies of human resistance to malaria. If you enjoy dealing with real-world data management problems, have an interest in the life sciences and/or public health, and enjoy working with a diverse community of people from different parts of the world, your application would be very welcome.

Further details are available at the link below: