Posts Tagged ‘aws’

I just stumbled upon Brad Chapman’s Blue Collar Bioinformatics blog, it looks like a great resource, here’s a few tidbits…

Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog – Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions. …

Parallel upload to Amazon S3 with python, boto and multiprocessing – One challenge with moving analysis pipelines to cloud resources like Amazon EC2 is figuring out the logistics of transferring files. Biological data is big; with the rapid adoption of new machines like the HiSeq and decreasing sequencing costs, the data transfer question isn’t going away soon. The use of Amazon in bioinformatics was brought up during a recent discussion on the BioStar question answer site. Deepak’s answer highlighted the role of parallelizing uploads and downloads to ease this transfer burden. Here I describe a method to improve upload speed by splitting over multiple processing cores. …

Next generation sequencing information management and analysis system for Galaxy – Next generation sequencing technologies like Illumina, SOLiD and 454 have provided core facilities with the ability to produce large amounts of sequence data. Along with this increased output comes the challenge of managing requests and samples, tracking sequencing runs, and automating downstream analyses. Our group at Massachusetts General Hospital approached these challenges by developing a sample submission and tracking interface on top of the web-based Galaxy data integration platform. It provides a front end for biologists to enter their sample details and monitor the status of a project. For lab technicians doing the sample preparation and sequencing work, the system tracks sample states via a set of progressive queues providing data entry points at each step of the process. On the back end, an automated analysis pipeline processes data as it arrives off the sequencer, uploading the results back into Galaxy. …

CloudBioLinux: progress on bioinformatics cloud images and data – My last post introduced a framework for building bioinformatics cloud images, which makes it easy to do biological computing work using Amazon EC2 and other on-demand computing providers. Since that initial announcement we’ve had amazing interest from the community and made great progress with: A permanent web site at cloudbiolinux.org; Additional software and genomic data; New user documentation; A community coding session: Codefest 2010 …

Evaluating key-value and document stores for short read data – Designing responsive web interfaces for analyzing short read data requires techniques to rapidly retrieve and display all details associated with a read. My own work on this has been relying heavily on Berkeley DB key/value databases. For example, an analysis will have key/value stores relating the read to aligned positions in the genome, counts of reads found in a sequencing run, and other associated metadata. A recent post by Pierre on storing SNPs in CouchDB encouraged me to evaluate my choice of Berkeley DB for storage. My goals were to move to a network accessible store, and to potentially incorporate the advanced query features associated with document oriented databases. … I decided to do an in-depth evaluation of three stores: Tokyo Cabinet, and its network server Tokyo Tyrant, using the pytyrant library; CouchDB, using the couchdb-python library; MongoDB, using pymongo.

Advertisements

Update: I’ve now rolled this blog post into the Ubuntu wiki’s: EC2 Starters Guide page. Hopefully this helps out the Ubuntu community!

The informatics team here at MalariaGEN have been working with ec2 since before I joined them. So naturally, it’s one technology with which I’ve had to come to grips in the course of doing my job. For me, EC2 had a fairly steep learning curve, and after spending a while trying to learn it through doing, I decided that I would just have to spend some time getting properly to grips with how things worked. As part of that I decided to document it in a way that I’d not yet seen on the web: logically, comprehensively, explaining all the strange concepts and quirks that were clouding my understanding and stopping me from getting my job done efficiently. (more…)