A Proposed Paradigm for Expressed Sequence Tags Data Format – An Application of Hidden Markov Models

Yen-I Chiang; Guan-I  Wu

doi:10.6180/jase.2009.12.3.10

A Proposed Paradigm for Expressed Sequence Tags Data Format – An Application of Hidden Markov Models

Computer Science and Information Engineering

Yen-I Chiang This email address is being protected from spambots. You need JavaScript enabled to view it.^1,2, and Guan-I Wu¹

¹Department of Information Management, Chang Gung University, Taoyuan, Taiwan, R.O.C.
²Bioinformatics Center, Chang Gung University, Taoyuan, Taiwan, R.O.C.

Received: February 25, 2008
Accepted: April 25, 2009
Publication Date: September 1, 2009

Download Citation: ||https://doi.org/10.6180/jase.2009.12.3.10

ABSTRACT

In the era of post-Human Genome Project, researches have shifted the emphasis from the mapping of human genomic to the discovery of correlation between genetic markers and clinical phenotypes, where finding effective treatment against disease are becoming crucial and applicable goals. The Expressed Sequence Tags (ESTs) data plays an important role in the completion of the Human Genome Sequencing and is widely used for gene discovery, polymorphism analysis, expression studies, and gene prediction. However, due to the chemical properties and manufacturing processes, ESTs data might contain errors, which might mislead Bioinformatics researchers that attempt to use EST-libraries to identify Single Nucleotide Polymorphisms (SNPs). Therefore this study proposes a paradigm for EST data, where users might better address this issue and use them to correctly identify SNPs.

Keywords: Expressed Sequence Tags, Hidden Markov Models, Base-Calling, Electropherogram

REFERENCES

[1] Baxevanis, A. D. and Ouellette, B. F. F., Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 3rd. ed., John Wiley, Hoboken, N.J. (2005).
[2] Chen, S. N. and Huang, C. W., “Distributed Computing Platform for Solving Massive Computing and Data Problems in Bioinformatics,” Tamkang Journal of Science and Engineering, Vol. 9, pp. 177183 (2006).
[3] Hoffman, M. A., “The Genome-Enabled Electronic Medical Record,” Journal of Biomedical Informatics, Vol. 40, pp. 4446 (2007).
[4] Lu, Q., Hao, P., Curcin, V., He, W., Li, Y. Y., Luo, Q. M., Guo, Y. K. and Li, Y. X., “KDE Bioscience: Platform for Bioinformatics Analysis Workflows,” Journal of Biomedical Informatics, Vol. 39, pp. 440450 (2006).
[5] Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Duboue, P. A., Weng, W., Wilbur, W. J., Hatzivassiloglou, V. and Friedman, C., “GeneWays: A System for Extracting, Analyzing, Visualizing, and Integrating Molecular Pathway Data,” Journal of Biomedical Informatics, Vol. 37, pp. 4353 (2004).
[6] Caron, H., Van Schaik, B., Van der Mee, M., Baas, F., Riggins, G., Van Sluis, P., Hermus, M. C., Van Asperen, R., Boon, K., Voute, P. A., Heisterkamp, S., Van Kampen, A. and Versteeg, R., “The Human Transcriptome Map: Clustering of Highly Expressed Genes in Chromosomal Domains,” Science, Vol. 291, pp. 12891293 (2001).
[7] Kalyanaraman, A., Aluru, S., Kothari, S. and Brendel, V., “Efficient Clustering of Large EST Data Sets on Parallel Computers,” Nucleic Acids Research, Vol. 31, pp. 29632974 (2003).
[8] Lee, S., Clark, T., Chen, J., Zhou, G., Scott, L. R., Rowley, J. D. and Wang, S. M., “Correct Identification of Genes from Serial Analysis of Gene Expression Tag Sequences,” Genomics, Vol. 79, pp. 598602 (2002).
[9] Muilu, J., Rodriguez-Tome, P. and Robinson, A., “GBuilder-An Application for the Visualization and Integration of EST Cluster Data,” Genome Research, Vol. 11, pp. 179184 (2001).
[10] Azad, R. K. and Borodovsky, M., “Effects of Choice of DNA Sequence Model Structure on Gene Identification Accuracy,” Bioinformatics, Vol. 20, pp. 9931005 (2004).
[11] Hotz-Wagenblatt, A., Hankeln, T., Ernst, P., Glatting, K. H., Schmidt, E. R. and Suhai, S., “ESTAnnotator: A Tool for High throughput EST Annotation,” Nucleic Acids Research, Vol. 31, pp. 37163719 (2003).
[12] Brown, N. P., Sander, C. and Bork, P., “Frame: Detection of Genomic Sequencing Errors,” Bioinformatics, Vol. 14, pp. 367371 (1998).
[13] Walsh, P. S., Fildes, N. J. and Reynolds, R., “Sequence Analysis and Characterization of Stutter Products at the Tetranucleotide Repeat Locus vWA,” Nucleic Acids Research, Vol. 24, pp. 28072812 (1996).
[14] Wang, J. P. and Widom, J., “Improved Alignment of Nucleosome DNA Sequences Using a Mixture Model,” Nucleic Acids Research, Vol. 33, pp. 67436755 (2005).
[15] Baldi, P. and Pollastri, G., “A Machine Learning Strategy for Protein Analysis,” IEEE Intelligent Systems and Their Applications, Vol. 17, pp. 2835 (2002).
[16] Birney, E., “Hidden Markov Models in Biological Sequence Analysis,” IBM Journal of Research and Development, Vol. 45, pp. 449454 (2001).
[17] Eddy, S. R., “Profile Hidden Markov Models,” Bioinformatics, Vol. 14, pp. 755763 (1998).
[18] Kuo, W. P., Kim, E. Y., Trimarchi, J., Jenssen, T. K., Vinterbo, S. A. and Ohno-Machado, L., “A Primer on Gene Expression and Microarrays for Machine Learning Researchers,” Journal of Biomedical Informatics, Vol. 37, pp. 293303 (2004).
[19] Zhang, J., Shen, D., Zhou, G., Su, J. and Tan, C. L., “Enhancing HMM-Based Biomedical Named Entity Recognition by Studying Special Phenomena,” Journal of Biomedical Informatics, Vol. 37, pp. 411422 (2004).
[20] Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J., Mu?ller, W. E. G., Wetter, T. and Suhai, S., “Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs,” Genome Research, Vol. 14, pp. 11471159 (2004).
[21] Zhang, J., Wheeler, D. A., Yakub, I., Wei, S., Sood, R., Rowe, W., Liu, P. P., Gibbs, R. A. and Buetow, K. H., “SNPdetector: A Software Tool for Sensitive and Accurate SNP Detection,” PLoS Comput Biol, Vol. 1, pp. 395404 (2005).
[22] Ewing, B., Hillier, L., Wendl, M. C. and Green, P., “Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment,” Genome Research, Vol. 8, pp. 175185 (1998).
[23] Baumgartner, C., Ma?tya?s, G., Steinmann, B., Eberle, M., Stein, J. I. and Baumgartner, D., “A Bioinformatics Framework for Genotype-Phenotype Correlation in Humans with Marfan Syndrome Caused by FBN1 Gene Mutations,” Journal of Biomedical Informatics, Vol. 39, pp. 171183 (2006).
[24] Singh, R., Nielsen, A. L., Johansen, M. G. and Jorgensen, A. L., “Genetic Polymorphism and Sequence Evolution of an Alternatively Spliced Exon of the Glial Fibrillary Acidic Protein Gene, GFAP,” Genomics, Vol. 82, pp. 185193 (2003).
[25] Taylor, N. E. and Greene, E. A., “PARSESNP: A Tool for the Analysis of Nucleotide Polymorphisms,” Nucleic Acids Research, Vol. 31, pp. 38083811 (2003).
[26] Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Lane, C. R., Lim, E. P., Kalyanaraman, N., Nemesh, J., Ziaugra, L., Friedland, L., Rolfe, A., Warrington, J., Lipshutz, R., Daley, G. Q. and Lander, E. S., “Characterization of Single-Nucleotide Polymorphisms in Coding Regions of Human Genes,” Nature Genetics, Vol. 22, pp. 231238 (1999).
[27] Chen, L. Y. Y., Lu, S. H., Shih, E. S. C. and Hwang, M. J., “Single Nucleotide Polymorphism Mapping Using Genome-Wide Unique Sequences,” Genome Research, Vol. 12, pp. 11061111 (2002).
[28] Daly, M. J., Rioux, J. D., Schaffner, S. F., Hudson, T. J. and Lander, E. S., “High-Resolution Haplotype Structure in the Human Genome,” Nature Genetics, Vol. 29, pp. 229232 (2001).
[29] Johnson, G. C. L., Esposito, L., Barratt, B. J., Smith, A. N., Heward, J., Di Genova, G., Ueda, H., Cordell, H. J., Eaves, I. A., Dudbridge, F., Twells, R. C. J., Payne, F., Hughes, W., Nutland, S., Stevens, H., Carr, P., Tuomilehto-Wolf, E., Tuomilehto, J., Gough, S. C. L., Clayton, D. G. and Todd, J. A., “Haplotype Tagging for the Identification of Common Disease Genes,” Nature Genetics, Vol. 29, pp. 233237 (2001).
[30] Huntley, D., Baldo, A., Johri, S. and Sergot, M., “SEAN: SNP Prediction and Display Program Utilizing EST Sequence Clusters,” Bioinformatics, Vol. 22, pp. 495 496 (2006).
[31] Xie, H., Zhu, W. Y., Wasserman, A., Grebinskiy, V., Olson, A. and Mintz, L., “Computational Analysis of Alternative Splicing Using EST Tissue Information,” Genomics, Vol. 80, pp. 326330 (2002).
[32] Brandis, J. W., “Dye Structure Affects Taq DNA Polymerase Terminator Selectivity,” Nucleic Acids Research, Vol. 27, pp. 19121918 (1999).
[33] Chou, H. H. and Holmes, M. H., “DNA Sequence Quality Trimming and Vector Removal,” Bioinformatics, Vol. 17, pp. 10931104 (2002).
[34] Scheetz, T. E., Trivedi, N., Roberts, C. A., Kucaba, T., Berger, B., Robinson, N. L., Birkett, C. L., Gavin, A. J., O’Leary, B., Braun, T. A., Bonaldo, M. F., Robinson, J. P., Sheffield, V. C., Soares, M. B. and Casavant, T. L., “ESTprep: Preprocessing cDNA Sequence Reads,” Bioinformatics, Vol. 19, pp. 13181324 (2003).
[35] Xing, Y., Resch, A. and Lee, C., “The Multiassembly Problems: Reconstructing Multiple Transcript Isoforms from EST Fragment Mixtures,” Genome Research, Vol. 14, pp. 426441 (2004).
[36] Automated DNA Sequencing Chemistry Guide, Applied Biosystems Inc. (2002).
[37] Gajer, P., Schatz, M. and Salzberg, S. L., “Automated Correction of Genome Sequence Errors,” Nucleic Acids Research, Vol. 32, pp. 562569 (2004).
[38] Lottaz, C., Iseli, C., Jongeneel, C. V. and Bucher, P., “Modeling Sequencing Errors by Combining Hidden Markov Models,” Bioinformatics, Vol. 19, pp. 103 112 (2003).
[39] Hills, H. G. “Peak Patterns Seen Using AmpliTaqR DNA Polymerase, FS.,” http://www.abrf.org/Other/ ABRFMeetings/abrf97/DNASEQ97/AmpliTaq.html.
[40] Durbin, R., Biological Sequence Analysis: Probabalistic Models of Proteins and Nucleic Acids, Cambridge University Press, Cambridge, UK New York (2005).
[41] Anches, I. S., “Noise-Compensated Hidden Markov Models,” IEEE Transactions on Speech and Audio Processing, Vol. 8, pp. 533540 (2000).
[42] Matsumoto, T., Yukawa, W., Nozaki, Y., Nakashige, R., Shinya, M., Makino, S., Yagura, M., Ikuta, T., Imanishi, T., Inoko, H., Tamiya, G. and Gojobori, T., “Novel Algorithm for Automated Genotyping of Microsatellites,” Nucleic Acids Research, Vol. 32, pp. 6069 6077 (2004).
[43] Rao, B. S. and Buckler-White, A., “Direct Visualization of Site-Specific and Strand-Specific DNA Methylation Patterns in Automated DNA Sequencing Data,” Nucleic Acids Research, Vol. 26, pp. 25052507 (1998).
[44] Bilmes, J. A., “A Gentle Tutorial of the EM Algorithm and Its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” U. C. Berkely, Technical Report No. TR-97-021, Berkeley, California (2005).
[45] Boufounos, P., El-Difrawy, S. and Ehrlich, D., “Basecalling Using Hidden Markov Models,” Journal of the Franklin Institute, Vol. 341, pp. 2336 (2004).
[46] Lin, N. and Brian, C. L., “Gesture Classification Using Hidden Markov Models and Viterbi Path Counting,” Proceeding: VIIth Digital Image Computing Techniques and Applications, Sydney, Dec., pp. 273282 (2003). [47] Bilmes, J. A., “A Gentle Tutorial of the EM Algorithm and Its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models,” U. C. Berkely, Technical Report No. TR-97-021, Berkeley, California (2005).
[47] Chen, K., McLellan, M., Ding, L., Wend, M., Kasai, Y., Wilson, R. and Mardis E. “PolyScan: An Automatic Indel and SNP Detection Approach to the Analysis of Human Resequencing Data,” Genome Research, Vol. 17, pp. 659666 (2007).
[48] Fendt, L., Zimmermann, B., Daniaux, M. and Parson, W., “Sequencing Strategy for the Whole Mitochondrial Genome Resulting in High Quality Sequences,” BMC Genomics, Vol. 10:139, pp. 135 (2009).
[49] Na, J., Roh, K., Apostolico, A. and Park, K., “Alignment of Biological Sequences with Quality Scores,” International Journal of Bioinformatics Research and Applications, Vol. 5:1, pp. 97113 (2009).
[50] Ngamphiw, C., Kulawonganunchai, S., Assawamakin, A., Jenwitheesuk, E. and Tongsima, S., “VarDetect: A Nucleotide Sequence Variation Exploratory Tool,” BMC Bioinformatics, Vol. 9(Suppl 12):S9, pp.113 (2008).