BioHPC Next Generation Sequencing Support
We are currently extending BioHPC to suport the analysis of
next generation sequencing results. BioHPC now features web and web service interfaces
to the following analysis applications:
FASTX, SamTools, Bowtie, TopHat, Cufflinks, BWA, and RNASeq -
a new RNA-Seq analysis pipeline developed at Cornell. All these applications interact with
a new Next-Gen data managament module designed for storage and management of various
(usually large) data files involved in the analysis. The module automatically captures
the Illumina sequencing results as well as files produced by analysis
applications and makes them available for further processing at BioHPC site without
the need for back-and-forth file transfer between our servers and the users' client machines.
More specifically, the data management module consists of several components:
Run Manager: connects to the sequencing facility and automatically detects finished sequencing runs for which base calling has been completed. It then configures the run in BioHPC database and sends an invitation to the facility manager to approve the results for distribution to users. Once approved, the results (read files) are asynchronously transferred to BioHPC file server and catalogued there for further use. Once the transfer is complete, all users assigned to distributed lanes are automatically notified by an e-mail message containing download links.
Lane Browser: allows users to browse their sequencing read files (Illumina lanes) catalogued at BioHPC. The browser displays lane annotation information and allows the file owner to grant additional users access to a file. Read files obtained outside of the Cornell sequencing facility can also be uploaded and catalogued at BioHPC.
File Manager: allows users to upload and manage various files needed in downstream data analysis, such as reference genome files and annotation files. Files may be assigned categories and descriptions, and shared between several users.
Besides the data management module,
BioHPC features a Pipeline Manager (currently in beta-version) which
allows users to streamline their calculations by connecting multiple Next-Gen applications into
analysis pipelines. Each pipeline step is individually configurable using
web interface page of the corresponding application, with input files selected either from
among the files registered in the data management module or from files anticipated from
previous pipeline steps. The pipeline steps are submitted to our clusters as regular BioHPC jobs
so that standard BioHPC mechanisms can be used for job control and result retrieval.
Users set up and control pipelines using our
specially constructed web interface, although we are also planning a web service layer serving
this purpose. The web service interface will allow pipelines
to be controlled from any client application, such as the MBF platform,
Illumina Genome Studio, or Trident scientific workflow workbench.
The new module is currently geared to handle mainly Illumina sequencing results, but extensions are possible.
Below are screenshots showing some aspects of
next generation sequencing support module.