Posts Tagged ‘pipelines’

I just spent a very pleasant afternoon catching up with colleagues at the Image Bioinformatics Research Group, based in the department of Zoology here in Oxford. Here’s a few tidbits I picked up …

Tanya Gray is working on the MIIDI standard (Minimum Information for an Infections Disease Investigation) and associate tools. She’s done some very nice work on a MIIDI metadata editor, using eXist and Orbeon forms, with her own additions to generate XForms from an annotated XML Schema. Tanya’s also working on the DryadUK project, which is a data repository supporting publication of data associated with journal articles.

Stephen Wan (visiting from CSIRO) has developed a cool extension for Firefox (and now Chrome) called IBES (In-Browser Elaborative Summariser). If you point it at Wikipedia, for each link you hover over it shows a summary of the page at that link, built intelligently from the link’s context. Then if you navigate to the link, it tells you where you came from. Very handy if (like me) each visit to Wikipedia is a rambling journey, and you often forget why you went there in the first place. He’s also done some related work to help navigate citations in scholarly articles, called CSIBS (The Citation-Sensitive In-Browser Summarizer).

Alex Dutton is working on the JISC Open Citations project. He has some nice visualisations of citation networks (although one of the articles in that graph looks like it cites itself – if only that were possible :). The graphs are generated using dot from RDF representation of metadata from the PubMedCentral Open-Access journal articles. All of the usual dot options are available, so you can play with how the networks get rendered. The whole site is driven by SPARQL, and the bottom of each page shows the SPARQL queries used to generate the page content, so you can see what’s going on under the hood.

Bhavana Ananda is working on the JISC DataFlow project, the DataStage component of which is a follow-on from previous work by Graham Klyne on the Admiral project. I think the philosophy of simple tools to help research groups manage and share their data with each other has a lot of traction, and I think it’s great they’ve got funding to turn the Admiral prototypes into something more.

Graham Klyne is embroiled in the Workflow 4Ever project, and we had a great chat about possible connections with managing our Plasmodium SNP discovery and genotyping pipelines for MalariaGEN. I’m now expecting Graham to solve all my problems.

And David Shotton (group head) is, as always, making it all happen. It was great to raise my head above the trenches for a few hours, I need to do that more often.

Advertisements

I just stumbled upon Brad Chapman’s Blue Collar Bioinformatics blog, it looks like a great resource, here’s a few tidbits…

Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog – Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions. …

Parallel upload to Amazon S3 with python, boto and multiprocessing – One challenge with moving analysis pipelines to cloud resources like Amazon EC2 is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the HiSeq and decreasing sequencing costs, the data transfer question isn’t going away soon. The use of Amazon in bioinformatics was brought up during a recent discussion on the BioStar question answer site. Deepak’s answer highlighted the role of parallelizing uploads and downloads to ease this transfer burden. Here I describe a method to improve upload speed by splitting over multiple processing cores. …

Next generation sequencing information management and analysis system for Galaxy – Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses. Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy. …

CloudBioLinux: progress on bioinformatics cloud images and data – My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with: A permanent web site at cloudbiolinux.org; Additional software and genomic data; New user documentation; A community coding session: Codefest 2010 …

Evaluating key-value and document stores for short read data – Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata. A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. … I decided to do an in-depth evaluation of three stores: Tokyo Cabinet, and its network server Tokyo Tyrant, using the pytyrant library; CouchDB, using the couchdb-python library; MongoDB, using pymongo.