Interpolated Markov Models for DNA Sequence Analysis

Arthur L. Delcher
Professor of Computer Science
Loyola College in Maryland

Research Scientist
Celera Genomics

Steven L. Salzberg
Director of Bioinformatics
The Institute for Genomics Research

Research Professor, Computer Science Department
Johns Hopkins University

Contact Information

Arthur L. Delcher
4501 N. Charles Street
Baltimore, MD 21210-2699
Phone: (410) 617-2740
Fax: (410) 617-2157
Email: Delcher@cs.loyola.edu
URL: http://www.cs.loyola.edu/~delcher

Steven L. Salzberg
9712 Medical Center Drive
Rockville, MD 20850
Phone: (301) 315-2537
Fax: (301)838-0208
Email: salzberg@tigr.org
URL: http://www.cs.jhu.edu/~salzberg

WWW PAGE

http://www.tigr.org/softlab/glimmer/glimmer.html

List of Supported Students and Staff (optional)

Adam Phillippy (aphillip@cs.loyola.edu), undergraduate Computer Science major, Class of 2002

Project Award Information

Keywords

bioinformatics, interpolated Markov models, gene identification, probabilistic modeling, data mining

 Project Summary

The large amounts of biological sequence data now being generated by numerous genome sequencing projects demand new analysis tools. The goal of this project is to develop and extend our Glimmer gene-identification system and to advance the state of the art for the underlying Interpolated Markov Model (IMM) technology, which has many potential uses in computational biology. The principal foci of the proposed research include the following:

 Publications and Products

The following publication describes our latest improvements to the Glimmer gene-identification system:

Improved Microbial Gene Identification with Glimmer, A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg. Nucleic Acids Research, 27:23 (1999), 4636-4641.

The Glimmer system, which is freely available to the academic community, is available at http://www.tigr.org/softlab/glimmer/glimmer.html

Project Impact

 Goals, Objectives, and Targeted Activities

The immediate goals of the project are to improve the performance of the Glimmer system in microbial gene identification. Specifically we will be working to refine our latest version of IMM’s, which we call the Interpolated Context Model, and to develop more accurate methods to identify gene signals that more accurately determine the exact start and stop of genes.

Project References

Microbial gene identification using interpolated Markov models, S. Salzberg, A. Delcher, S. Kasif, and O. White. Nucleic Acids Research, 26:2 (1998), 544-548.

Improved Microbial Gene Identification with Glimmer, A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg. Nucleic Acids Research, 27:23 (1999), 4636-4641.

Interpolated Markov models for eukaryotic gene finding, S.L. Salzberg, M. Pertea, A.L. Delcher, M.J. Gardner, and H. Tettelin, Genomics, 59(1):24--31, July 1999.

Area Background

The quantity of DNA sequence data is exploding, and with this growth in information comes a parallel growth in the demand for computational methods for sequence analysis. Our project is interdisciplinary work that combines the latest techniques in data modeling and data mining with knowledge of molecular biology to find genes and other scientifically significant patterns in DNA sequence databases. One of the first and most important steps in the analysis of any genome is the identification of all its genes. Computational methods have already become the most important tool available for this annotation task, and with the rapid scaling up insequencing efforts, they will become even more critical.

The most reliable way to identify a gene in a new genome is to find a similar sequence from another organism. This can be done today very effectively using programs such as BLAST and FASTA to search all the entries in GenBank. However, many of the genes in new genomes (typically 30-40\%) have no significant homology to known genes. For these genes, we must rely on computational methods of scoring the coding region to identify the genes. Essentially, we construct a probabilistic model of known genes in an organism and then use this model to score other regions to identify genes. We have developed a system, Glimmer, that uses a technique called interpolated Markov models (IMMs) to score regions in microbial sequences. IMMs are in principle and in practice more powerful than Markov chains, as has been demonstrated in the speech and language research community.

Area References

Computational Methods in Molecular Biology, edited by Steven Salzberg, David Searls, and Simon Kasif, Elsevier, 1998.

Introduction to Computational Molecular Biology, J. Setubal and J. Meidanis, PWS 1997.