ITA
ENG

A high-throughput distributed DNA sequence analysis and database system

Authors

Inman, JT Flores, HR May, GD Weller, JW Bell, CJ

Citation

Jt. Inman et al., A high-throughput distributed DNA sequence analysis and database system, IBM SYST J, 40(2), 2001, pp. 464-486

Citations number

Categorie Soggetti

Computer Science & Engineering

Journal title

IBM SYSTEMS JOURNAL

ISSN journal

00188670 → ACNP

Volume

Issue

Year of publication

2001

Pages

464 - 486

Database

ISI

SICI code

0018-8670(2001)40:2<464:AHDDSA>2.0.ZU;2-W

Abstract

The National Center for Genome Resources (NCGR) has developed a high-throug hput DNA (deoxyribonucleic acid) sequence analysis pipeline, which allows r esearchers at remote sites to submit biological sequence information for ra pid analysis, the results of which can be queried through a Web interface. Behind the browser interface is a relational database used to manage both t he raw data and the results of the different analyses performed, and a serv er, which performs those analyses. The system allows multiple contributors to submit data and also allows the data to be marked as "private" or as ava ilable to the general public. The CPU-intensive part of the processing is d one on a 40-processor domain of a Sun Enterprise 10000 computer, which is r epresented by a distributed system of software objects, implemented in CORB A (TM) (Common Object Request Broker Architecture (TM)). In this paper we d iscuss the architecture of the pipeline, the database support, types of DNA sequence analysis used, the distributed analysis system, and the capabilit ies of the Web interface. As a case study, we present data from an ongoing collaborative project in which expressed sequence tags (ESTs) from Medicago truncatula are being processed. M. truncatula is a plant that is used as a research model for crops in the legume family, an economically important g roup of food and forage plants.