Keywords

1 Introduction

In recent years, word-frequency methods become standard tools for analysis and comparison of DNA sequences. Principal advantage of these methods over alignment, viz. NCBI-BLAST family of software [1], is better scaling of computation times with sequence length. In the studies of clustering of taxonomic units, subtyping and likes, standard word frequency methods [2,3,4] have delivered a reliable means to analyze DNA sequences. But their performance falls through in several cases, viz., determination of genomic distance between two DNA sequences, identification of conserved regions like short DNA motifs [5]. Major cause behind the difference is lying with how the method views the sequence. Word-frequency methods completely neglect positional information and interpret a sequence as a bag of words. Limitation is thus non-capturing of entropy variation over biological changes incurred in sequence. In contrast, alignment-based methods consider exact position and quality of similarity of every part of the sequence that reflect compositional structure with higher precision. However, scalability of alignment-based methods is an open challenge. Hence, initiatives on word-based method with new techniques are in the literature for scalable sequence analysis. Efforts are on to improve precision of outcome of word methods through inclusion of more descriptive information about compositional structure of DNA. Composition vector [6] is a way to enrich feature frequency profiles by subtracting the random background from frequency count of each word using a Markov model of order (k-2). that diminish the influence of random neutral mutations. An alternative to word frequency statistics is accounting probabilistic appearance of common words between two sequences, popularly known as D2 statistics. The statistical dependence between adjacent occurrences of common words among sequences is taken care of using a technique called spaced-word [7]. However, use of larger word size in all these methods generates huge number of possible words that significantly increase space time complexity. Hence, reduction in word size without sacrificing the precision is a need. Instead of continually increasing the length of words, the present work focuses on the patterns of their successive appearance over the sequence. It incorporates intervals and orders of their occurrence into the method. This result in extraction of word entropy to such an extent that optimum level of entropy is achieved with only dinucleotide words for taxonomic classification of Flaviviridae Virus (FV).

2 Method

With an alphabet \(\varSigma \) = {A, C, G, T} and L \(\in \) N, let S is a biological sequence of length L over \(\varSigma ^L\). A subsequence of length 2 over the sequence S is designated as a dinucleotide. With four nucleotide symbols in \(\varSigma \), the set of all possible dinucleotides are {AA,AC,AG,AT,CA,CC,CG,CT,TA,TC,TG,TT}. Let the \(j^{th}\)(1\(\le \)j\(\le \)16)dinucleotide \(w^j\) occurs at locations \(l^j_i\), i=1(1)m, m being the maximum number of occurrence of the dinucleotide over the sequence S. We define the set of intervals between each successive locations of the \(j^{th}\) dinucleotide, \(w^j\), as \(D^j=\{d_1^j, d_2^j,\cdots d_m^j\}\): where

$$\begin{aligned} d_i^j = l_{i+1}^j - l_{i}^j, i \le m \end{aligned}$$
(1)

To characterize the order of occurrence of intervals of the \(j^{th}\) dinucleotide \(w^j\), we impose the concept of totally ordered set \(T^j\) over the set \(D^j\). Mathematically, a total order set is a set with a relation on the set (called the total order) that satisfies the conditions for partial order with an additional condition called comparability condition. A relation \(\le \) is a total order on a set T if the following properties hold, Reflexivity: a le a for all a \(\in \) T, Antisymmetric: a \(\le \) b and b \(\le \) a implies a=b, Transitivity: a \(\le \) b and b \(\le \) c implies a \(\le \) c, Comparability (trichotomy law): For any a, b \(\in \) T either a \(\le \) b or b \(\le \) a hold. To apply the relation \(\le \) we define the elements of the totally order set \(T^j\) over the set \(D^j\) of intervals of the dinucleotide \(w^j\) as

$$\begin{aligned} t_i^j = \sum _{i=1}^{r} d_i^j, 1 \le r \le m \end{aligned}$$
(2)

The totally order set of intervals \(T^j\), of \(j^{th}\) dinucleotide \(w^j\), is independent of other dinucleotides. Given the set \(T^j\) we can obtain information about the occurrence frequency along with distribution pattern and occurrence order of the dinucleotide \(w^j\) over the sequence S. The entropy of a totally order set of intervals can reflect the importance of position (both location and its order of occurrence) of a dinucleotide over the sequence. We construct a discrete probability mass function \(P^j = \{p_1^j, p_2^j, \cdots , p_m^j\}\) where \(p_i^j = t_i^j / \sum _{i=1}^{m} t_i^j\). The Shannon’s entropy \(h^j\) is thus can be defined

$$\begin{aligned} h^j = - \sum _{i=1}^{m} t_i^j * log_2(t_i^j) \end{aligned}$$
(3)

Entropies of all the dinucleotides form the representation vector of the given sequence S as in feature space of dinucleotide distribution as H=(\(h^1,h^2,\cdots , h^r\)), r=16. The representation vector H can be used as the transformed presentation for the given sequence S in feature space on which we can apply a distance measures to compute similarity/dissimilarity with other sequences.

2.1 Distance Measure between sequences in feature space

To measure the distance between two biological sequences in feature space (\(H^i\) and \(H^j\)) we use Euclidean distance between given pair of sequences, as follows

$$\begin{aligned} \sigma _{ij} = \sqrt{\sum _{i=1}^{r} (h^r_i - h^r_j)^2} \end{aligned}$$
(4)

3 Experiment

3.1 Vector Identification

Vector identification is primary issue to control epidemic situation owing to viral infection. For instance, infection of Dengue, a species of Flaviviridae, could lead to a disastrous state, had the mosquito Aedes aegypti not been identified as its carrier. Flaviviridae is a family of fast evolving RNA viruses that can infect mammals, birds and invertebrates (ticks and mosquitoes) organisms. For these viruses, transmission mode (vectors) plays a vital role in their life cycle in order to cause infection to respective hosts and thus identification of respective vectors can lead to framing strategies for controlling the virus.

Biological segregation of Flaviviridae into three genera, viz., Flavivirus, Hepacivirus and Pestivirus has carried out on criteria like host-range and outcome of infection. Genus Flavivirus are transmitted to their respective hosts only by different arthropod vectors like mosquitoes, ticks or insects via blood sucking. The second genus, Hepacivirus, infect only mammals mainly transmitted by blood contact, while the genus Pestivirus infects several non-human mammals via oral-fecal or respiratory routes. Another ecological lineage of viruses, NKV (non-known vector), for which no arthropod is yet known as vector, cause infection similar to Flavivirus. Thus, identifying families of these viruses through study of phylogeny will help in identifying possible vectors for unknown cases.

Molecular phylogeny is a technique to group genetically similar sequences. Life cycle of virus is interesting and a motivation for phylogeny study. Interaction between FVs and their vectors is a continuous process involving both immune system of vectors and escape mechanism of viruses. Viruses mimic vector-specific genomic pattern in course of the interaction, subjected to vector-induced pressures [8]. Thus, common patterns, in virus genomes captured by molecular phylogeny, are indication of vectorspecificity in Flaviviridae.

Flavivirus phylogeny

Dataset: Test datasets are compiled with 34 Flaviviridae virus genomes (Table 1) collected from the freely available repository of National Center of Biotechnology Information (https://www.ncbi.nlm.nih.gov/). These virus genomes vary from 9 to 12 kbp in length. Sequences with label D1-D12 are mosquito-borne, D13-D15 are insectborne, D16-D21 are tick-borne, D22-D28 are NKV, D29-D31 are Hepaciviruses and D32-D34 are Pestiviruses.

Table 1. Flaviviridae virus genomes

We apply the proposed method (DIP) to devise genomic distance among the viruses enlisted in Table 1. It transforms each genome sequence into feature space through entropy extraction of 16 dinucleotide (2bp) patterns. DIP next computes the distance matrix, elements of which are simply pair wise distances between all possible genome pairs in feature space. Phylogeny tree (Figure 1) that reflects the taxonomic classification of the enlisted Flaviviridae viruses is computed using the hierarchical clustering algorithms UPGMA.

Fig. 1.
figure 1

Cladogram tree of FV (Table 1) using 2bp

Fig. 2.
figure 2

Comparing performance and scalability

Clear segregation of FVs into three known genera, Hepacivirus (D29-S31), Pestivirus (D32-S34) and Flavivirus (D1-S28) based on genomic distances as captured by DIP is observed. Further, insect-borne (D13-S15), tick-borne (D16-S21) and mosquitoborne (D1-S12) viruses are alienated into subgroups under Flaviviruses group. Such distinct groups in the phylogeny patterns drawn by DIP indicates that viruses with common vectors are near classified together, reflecting their vector-specificity. DIP thus offers a way for vector identification under health hazard situation. It is interesting to note in Fig. 1 that viruses GBV-B (D29) and GBV-C (D31) are grouped into the same class with Hepacivirus. This is identical with earlier report [9] from polyprotein-based study. These are accepted candidate of Hepacivirus by the International Committee on Taxonomy of Viruses (ICTV). Moreover, NKV viruses, D22, D23 and D27 are paraphyletic with mosquito-borne viruses, whereas others NKV viruses, D24, D25, D26 and D28, are in a separate subgroup that is paraphyletic with Pestiviruses, indicating blood sucking non-mosquito pests as their possible vectors.

3.2 Performance and scalability

The phylogeny tree of Flaviviridae (Table 1) by Clustal Omega (www.ebi.ac.uk/Tools/msa/clustalo) is taken as reference for comparison with two alignment-free methods, composition vector (tlife.fudan.edu.cn/cvtree) and spaced word (http://spaced.gobics.de/). Topological Similarity (TS) computed following the method in (www.mas.ncl.ac.Uk/ntmwn/compare2trees). Results are reported in Table 2.

Table 2. Comparing the proposed method with existing

Figure 2 presents a comparative study on performance and scalability, showing excellence of DIP in addition to biological consistence of knowledge retrieved. Highest TS score obtained by DIP proves that appropriate statistical feature is more powerful than mere length of word in extracting information from natural sequence.

4 Conclusion

We present a concept of inducting interval and order of occurrence of words with frequency to capture minute changes in nucleotide patterns and develop a method for entropy computation. From patterns of dinucleotides only, the method builds taxonomic classification for Flaviviridae virus with higher precision. Advantages of introducing the concept are twofold: accuracy in result and economy in computation as longer words are not required.

Viruses with common vectors are genetically closer in terms of arrangements of dinucleotides. As the proposed method can classify viruses on the basis of their dinucleotide patterns, genetic closeness of a new virus to a class can lead to tracing its possible vector.