As sequenced genomes become larger and the sequencing process becomes faster, there is a need to develop a tool to analyze sequences in the whole genomic scale. Traditional ways are not applicable to the analysis of whole genome sequence set, since the size of individual whole genome ranges from several million base pairs to hundreds billion base pairs. To effectively manipulate the very large sequence data, it is necessary to use the indexed data structure for external memory. This thesis introduces a benchmark work by developing an educational tool named AutoCluster for the analysis and visualization of whole genome sequences using k-mer technique and CGR view.
The work consists of two parts: the data analysis subsystem and the visualization subsystem. The data analysis subsystem supports various transactions such as pattern matching, k-occurrence, and k-mer analysis. The visualization subsystem helps biologists and bioinformaticians to easily understand whole genome structure and feature by sequence viewer, annotation viewer, CGR (Chaos Game Representation) viewer, and k-mer viewer. The system also supports a user-friendly programming interface for batch processing and the extension for a specific purpose of a user. An educational tool can be useful in the usage of identifying conserved genes or sequences by the analysis of the common k-mers and annotation. We analyze the common k − mer for Archaea, viruses and bacteria reference genomes announced by NCBI. Finally, going through the content of some of the listed previous work many common k-mer occur in conserved region such as CDS, rRNA, and tRNA. In this study we focus only on DNA sequences and genomic strands.
This feature of CGR has led to tools for visual data analytic and so is AutoCluster. The CGR is a representation of all possible sequences in any length in a continuous space. It can be considered as a generalization of a Markov model HF and D. [2021] which is a stochastic method for randomly changing systems, which assumes that future states do not depend on past states. These models show all the possible states, as well as the transitions, the transition rate, and the probabilities between them.