Abstract
In genome analysis, common approach to all word methods is use of long words to improve precision in biological findings. However, arbitrary increment in word length cannot always be fruitful, rather causing increase in space-time complexity. We observe that instead of mere increase in length, integration of word intervals along with order and frequency of their occurrence have great impact in extracting sequence information with much smaller word length and devise a method, Dinucleotide Interval Patterns (DIP), for entropy retrieval from ordered sets of dinucleotide intervals. Experiments on natural sequences of Flaviviridae virus with length 9 to 12 kbp establish that only word size of 2bp is capable of deriving precise taxonomic classification of the virus. This is in sharp contrast to standard word-based methods requiring a minimum of 6bp word size to achieve nearly 30% Topological Similarity in comparison to 60% score by DIP with only 2bp.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In recent years, word-frequency methods become standard tools for analysis and comparison of DNA sequences. Principal advantage of these methods over alignment, viz. NCBI-BLAST family of software [1], is better scaling of computation times with sequence length. In the studies of clustering of taxonomic units, subtyping and likes, standard word frequency methods [2,3,4] have delivered a reliable means to analyze DNA sequences. But their performance falls through in several cases, viz., determination of genomic distance between two DNA sequences, identification of conserved regions like short DNA motifs [5]. Major cause behind the difference is lying with how the method views the sequence. Word-frequency methods completely neglect positional information and interpret a sequence as a bag of words. Limitation is thus non-capturing of entropy variation over biological changes incurred in sequence. In contrast, alignment-based methods consider exact position and quality of similarity of every part of the sequence that reflect compositional structure with higher precision. However, scalability of alignment-based methods is an open challenge. Hence, initiatives on word-based method with new techniques are in the literature for scalable sequence analysis. Efforts are on to improve precision of outcome of word methods through inclusion of more descriptive information about compositional structure of DNA. Composition vector [6] is a way to enrich feature frequency profiles by subtracting the random background from frequency count of each word using a Markov model of order (k-2). that diminish the influence of random neutral mutations. An alternative to word frequency statistics is accounting probabilistic appearance of common words between two sequences, popularly known as D2 statistics. The statistical dependence between adjacent occurrences of common words among sequences is taken care of using a technique called spaced-word [7]. However, use of larger word size in all these methods generates huge number of possible words that significantly increase space time complexity. Hence, reduction in word size without sacrificing the precision is a need. Instead of continually increasing the length of words, the present work focuses on the patterns of their successive appearance over the sequence. It incorporates intervals and orders of their occurrence into the method. This result in extraction of word entropy to such an extent that optimum level of entropy is achieved with only dinucleotide words for taxonomic classification of Flaviviridae Virus (FV).
2 Method
With an alphabet \(\varSigma \) = {A, C, G, T} and L \(\in \) N, let S is a biological sequence of length L over \(\varSigma ^L\). A subsequence of length 2 over the sequence S is designated as a dinucleotide. With four nucleotide symbols in \(\varSigma \), the set of all possible dinucleotides are {AA,AC,AG,AT,CA,CC,CG,CT,TA,TC,TG,TT}. Let the \(j^{th}\)(1\(\le \)j\(\le \)16)dinucleotide \(w^j\) occurs at locations \(l^j_i\), i=1(1)m, m being the maximum number of occurrence of the dinucleotide over the sequence S. We define the set of intervals between each successive locations of the \(j^{th}\) dinucleotide, \(w^j\), as \(D^j=\{d_1^j, d_2^j,\cdots d_m^j\}\): where
To characterize the order of occurrence of intervals of the \(j^{th}\) dinucleotide \(w^j\), we impose the concept of totally ordered set \(T^j\) over the set \(D^j\). Mathematically, a total order set is a set with a relation on the set (called the total order) that satisfies the conditions for partial order with an additional condition called comparability condition. A relation \(\le \) is a total order on a set T if the following properties hold, Reflexivity: a le a for all a \(\in \) T, Antisymmetric: a \(\le \) b and b \(\le \) a implies a=b, Transitivity: a \(\le \) b and b \(\le \) c implies a \(\le \) c, Comparability (trichotomy law): For any a, b \(\in \) T either a \(\le \) b or b \(\le \) a hold. To apply the relation \(\le \) we define the elements of the totally order set \(T^j\) over the set \(D^j\) of intervals of the dinucleotide \(w^j\) as
The totally order set of intervals \(T^j\), of \(j^{th}\) dinucleotide \(w^j\), is independent of other dinucleotides. Given the set \(T^j\) we can obtain information about the occurrence frequency along with distribution pattern and occurrence order of the dinucleotide \(w^j\) over the sequence S. The entropy of a totally order set of intervals can reflect the importance of position (both location and its order of occurrence) of a dinucleotide over the sequence. We construct a discrete probability mass function \(P^j = \{p_1^j, p_2^j, \cdots , p_m^j\}\) where \(p_i^j = t_i^j / \sum _{i=1}^{m} t_i^j\). The Shannon’s entropy \(h^j\) is thus can be defined
Entropies of all the dinucleotides form the representation vector of the given sequence S as in feature space of dinucleotide distribution as H=(\(h^1,h^2,\cdots , h^r\)), r=16. The representation vector H can be used as the transformed presentation for the given sequence S in feature space on which we can apply a distance measures to compute similarity/dissimilarity with other sequences.
2.1 Distance Measure between sequences in feature space
To measure the distance between two biological sequences in feature space (\(H^i\) and \(H^j\)) we use Euclidean distance between given pair of sequences, as follows
3 Experiment
3.1 Vector Identification
Vector identification is primary issue to control epidemic situation owing to viral infection. For instance, infection of Dengue, a species of Flaviviridae, could lead to a disastrous state, had the mosquito Aedes aegypti not been identified as its carrier. Flaviviridae is a family of fast evolving RNA viruses that can infect mammals, birds and invertebrates (ticks and mosquitoes) organisms. For these viruses, transmission mode (vectors) plays a vital role in their life cycle in order to cause infection to respective hosts and thus identification of respective vectors can lead to framing strategies for controlling the virus.
Biological segregation of Flaviviridae into three genera, viz., Flavivirus, Hepacivirus and Pestivirus has carried out on criteria like host-range and outcome of infection. Genus Flavivirus are transmitted to their respective hosts only by different arthropod vectors like mosquitoes, ticks or insects via blood sucking. The second genus, Hepacivirus, infect only mammals mainly transmitted by blood contact, while the genus Pestivirus infects several non-human mammals via oral-fecal or respiratory routes. Another ecological lineage of viruses, NKV (non-known vector), for which no arthropod is yet known as vector, cause infection similar to Flavivirus. Thus, identifying families of these viruses through study of phylogeny will help in identifying possible vectors for unknown cases.
Molecular phylogeny is a technique to group genetically similar sequences. Life cycle of virus is interesting and a motivation for phylogeny study. Interaction between FVs and their vectors is a continuous process involving both immune system of vectors and escape mechanism of viruses. Viruses mimic vector-specific genomic pattern in course of the interaction, subjected to vector-induced pressures [8]. Thus, common patterns, in virus genomes captured by molecular phylogeny, are indication of vectorspecificity in Flaviviridae.
Flavivirus phylogeny
Dataset: Test datasets are compiled with 34 Flaviviridae virus genomes (Table 1) collected from the freely available repository of National Center of Biotechnology Information (https://www.ncbi.nlm.nih.gov/). These virus genomes vary from 9 to 12 kbp in length. Sequences with label D1-D12 are mosquito-borne, D13-D15 are insectborne, D16-D21 are tick-borne, D22-D28 are NKV, D29-D31 are Hepaciviruses and D32-D34 are Pestiviruses.
We apply the proposed method (DIP) to devise genomic distance among the viruses enlisted in Table 1. It transforms each genome sequence into feature space through entropy extraction of 16 dinucleotide (2bp) patterns. DIP next computes the distance matrix, elements of which are simply pair wise distances between all possible genome pairs in feature space. Phylogeny tree (Figure 1) that reflects the taxonomic classification of the enlisted Flaviviridae viruses is computed using the hierarchical clustering algorithms UPGMA.
Cladogram tree of FV (Table 1) using 2bp
Clear segregation of FVs into three known genera, Hepacivirus (D29-S31), Pestivirus (D32-S34) and Flavivirus (D1-S28) based on genomic distances as captured by DIP is observed. Further, insect-borne (D13-S15), tick-borne (D16-S21) and mosquitoborne (D1-S12) viruses are alienated into subgroups under Flaviviruses group. Such distinct groups in the phylogeny patterns drawn by DIP indicates that viruses with common vectors are near classified together, reflecting their vector-specificity. DIP thus offers a way for vector identification under health hazard situation. It is interesting to note in Fig. 1 that viruses GBV-B (D29) and GBV-C (D31) are grouped into the same class with Hepacivirus. This is identical with earlier report [9] from polyprotein-based study. These are accepted candidate of Hepacivirus by the International Committee on Taxonomy of Viruses (ICTV). Moreover, NKV viruses, D22, D23 and D27 are paraphyletic with mosquito-borne viruses, whereas others NKV viruses, D24, D25, D26 and D28, are in a separate subgroup that is paraphyletic with Pestiviruses, indicating blood sucking non-mosquito pests as their possible vectors.
3.2 Performance and scalability
The phylogeny tree of Flaviviridae (Table 1) by Clustal Omega (www.ebi.ac.uk/Tools/msa/clustalo) is taken as reference for comparison with two alignment-free methods, composition vector (tlife.fudan.edu.cn/cvtree) and spaced word (http://spaced.gobics.de/). Topological Similarity (TS) computed following the method in (www.mas.ncl.ac.Uk/ntmwn/compare2trees). Results are reported in Table 2.
Figure 2 presents a comparative study on performance and scalability, showing excellence of DIP in addition to biological consistence of knowledge retrieved. Highest TS score obtained by DIP proves that appropriate statistical feature is more powerful than mere length of word in extracting information from natural sequence.
4 Conclusion
We present a concept of inducting interval and order of occurrence of words with frequency to capture minute changes in nucleotide patterns and develop a method for entropy computation. From patterns of dinucleotides only, the method builds taxonomic classification for Flaviviridae virus with higher precision. Advantages of introducing the concept are twofold: accuracy in result and economy in computation as longer words are not required.
Viruses with common vectors are genetically closer in terms of arrangements of dinucleotides. As the proposed method can classify viruses on the basis of their dinucleotide patterns, genetic closeness of a new virus to a class can lead to tracing its possible vector.
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions. Proc. Nat. Acad. Sci. U.S.A. 106(40), 17077–17082 (2009)
Sims, G.E., Kim, S.H.: Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). Proc. Nat. Acad. Sci. U.S.A. 108(20), 832–934 (2011)
Gao, L., Qi, J.: Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evol. Biol. 7, 41 (2007)
Alsop, E.B., Raymond, J.: Resolving prokaryotic taxonomy without rRNA: Longer oligonucleotide word lengths improve genome and metagenome taxonomic classication. PLoS ONE 8, e67337 (2013)
Hao, B.L., Qi, J., Wang, B.: Prokaryotic phylogeny based on complete genomes without sequence alignment. Mod. Phys. Lett. B 2, 1–4 (2003)
Leimeister, C.A., Boden, M., Horwege, S., Lindner, S.: Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics 30(14), 1991–1999 (2014)
Lobo, F.P., Mota, B.E.F., Pena, S.D.J., Azevedo, V., Macedo, A.M., et al.: Virus-Host coevolution: common patterns of nucleotide motif usage in flaviviridae and their hosts. PLoS ONE 4(7), e6282 (2009). doi:10.1371/journal.pone.0006282
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Mitra, U., Bhattacharyya, B. (2017). Classification of Vector-Borne Virus Through Totally Ordered Set of Dinucleotide Interval Patterns. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_51
Download citation
DOI: https://doi.org/10.1007/978-3-319-69900-4_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)