Interpolated Markov Models for DNA Sequence Analysis
Arthur L. Delcher
Professor of Computer Science
Loyola College in
Maryland
Research Scientist
Celera Genomics
Steven L. Salzberg
Director of Bioinformatics
The Institute for
Genomics Research
Research Professor, Computer Science Department
Johns Hopkins
University
Contact Information
Arthur L. Delcher
4501 N. Charles Street
Baltimore, MD
21210-2699
Phone: (410) 617-2740
Fax: (410) 617-2157
Email: Delcher@cs.loyola.edu
URL: http://www.cs.loyola.edu/~delcher
Steven L. Salzberg
9712 Medical Center Drive
Rockville, MD
20850
Phone: (301) 315-2537
Fax: (301)838-0208
Email: salzberg@tigr.org
URL: http://www.cs.jhu.edu/~salzberg
WWW PAGE
http://www.tigr.org/softlab/glimmer/glimmer.html
List of Supported Students and Staff (optional)
Adam Phillippy (aphillip@cs.loyola.edu), undergraduate Computer Science major, Class of 2002
Project Award Information
Keywords
bioinformatics, interpolated Markov models, gene identification, probabilistic modeling, data mining
Project Summary
The large amounts of biological sequence data now being generated by numerous genome sequencing projects demand new analysis tools. The goal of this project is to develop and extend our Glimmer gene-identification system and to advance the state of the art for the underlying Interpolated Markov Model (IMM) technology, which has many potential uses in computational biology. The principal foci of the proposed research include the following:
Publications and Products
The following publication describes our latest improvements to the Glimmer gene-identification system:
Improved Microbial Gene Identification with Glimmer, A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg. Nucleic Acids Research, 27:23 (1999), 4636-4641.
The Glimmer system, which is freely available to the academic community, is available at http://www.tigr.org/softlab/glimmer/glimmer.html
Project Impact
Goals, Objectives, and Targeted Activities
The immediate goals of the project are to improve the performance of the Glimmer system in microbial gene identification. Specifically we will be working to refine our latest version of IMM’s, which we call the Interpolated Context Model, and to develop more accurate methods to identify gene signals that more accurately determine the exact start and stop of genes.
Project References
Microbial gene identification using interpolated Markov models, S. Salzberg, A. Delcher, S. Kasif, and O. White. Nucleic Acids Research, 26:2 (1998), 544-548.
Improved Microbial Gene Identification with Glimmer, A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg. Nucleic Acids Research, 27:23 (1999), 4636-4641.
Interpolated Markov models for eukaryotic gene finding, S.L. Salzberg, M. Pertea, A.L. Delcher, M.J. Gardner, and H. Tettelin, Genomics, 59(1):24--31, July 1999.
Area Background
The quantity of DNA sequence data is exploding, and with this growth in information comes a parallel growth in the demand for computational methods for sequence analysis. Our project is interdisciplinary work that combines the latest techniques in data modeling and data mining with knowledge of molecular biology to find genes and other scientifically significant patterns in DNA sequence databases. One of the first and most important steps in the analysis of any genome is the identification of all its genes. Computational methods have already become the most important tool available for this annotation task, and with the rapid scaling up insequencing efforts, they will become even more critical.
The most reliable way to identify a gene in a new genome is to find a similar sequence from another organism. This can be done today very effectively using programs such as BLAST and FASTA to search all the entries in GenBank. However, many of the genes in new genomes (typically 30-40\%) have no significant homology to known genes. For these genes, we must rely on computational methods of scoring the coding region to identify the genes. Essentially, we construct a probabilistic model of known genes in an organism and then use this model to score other regions to identify genes. We have developed a system, Glimmer, that uses a technique called interpolated Markov models (IMMs) to score regions in microbial sequences. IMMs are in principle and in practice more powerful than Markov chains, as has been demonstrated in the speech and language research community.
Area References
Computational Methods in Molecular Biology, edited by Steven Salzberg, David Searls, and Simon Kasif, Elsevier, 1998.
Introduction to Computational Molecular Biology, J. Setubal and J. Meidanis, PWS 1997.