Keywords

1 Introduction

Circular RNAs (CircRNAs) are a type of non-coding RNAs with closed loop structures formed by back splicing [1]. Recently, large number of circRNAs are widely found in various livings [2], and they could regulate gene expression at transcriptional or post-transcriptional levels by titrating microRNAs (miRNAs) [3], regulating transcription and splicing [4, 5], even several circRNAs could translate to produce polypeptides [6]. Increasing researches have demonstrated that the mis-regulation of circRNAs may cause abnormal cellular functions and associated with various diseases [7]. Thus, disease-associated circRNAs are becoming a class of promising biomarkers for disease diagnosis and treatment.

However, it is costly and laborious to identify the disease-related circRNAs with biomedical experiments. Recently, several computational approaches have been developed. Lei et al. [8] firstly designed a path weighted approach named PWCDA to predict circRNA-disease associations. Likewise, KATZHCDA [9] is developed based on KATZ model to measure the probability for each pair of circRNA-disease associations, in which the circRNA expression similarity and disease phenotype similarity matrices are used as priori knowledge to establish the circRNA-disease heterogeneous network. DWNN-RLS [10] is designed based on Kronecker regularized least squares to predict the associations between circRNAs and diseases. iCircDA-MF [11] is developed based on non-negative matrix factorization by integrating the circRNA-gene, gene-disease and circRNA-disease relationships. Wang et al. [12] utilized a recommendation algorithm PersonalRank to measure the relevance between circRNAs and diseases based on circRNA expression profiles and functional similarity. Although several methods have developed for the circRNA-disease association prediction, it is still a challenge to obtain sufficiently accurate results.

In this study, we developed a novel framework for forecasting circRNA-disease associations named BWHCDA, which integrated multiple similarity measures and implemented bi-random walk algorithm (Fig. 1). First, circRNA regulatory similarity is effective measured based on circRNAs may play essential roles in regulating miRNA function in disease occurrence and progression. Moreover, combined with Gaussian interaction profiles (GIP) kernel similarity for circRNAs, the integrated circRNA similarity is effectively measured. Similarly, disease similarity is denoted as the average of disease sematic similarity and GIP kernel similarity for diseases. Subsequently, the heterogeneous network is constructed by combing the circRNA network, disease network and circRNA-disease associations. Then, circular bigraph (CBG) patterns are introduced in bi-random walk algorithm to predict the missing associations based on the heterogeneous network. The results show that BWHCDA could be considered as a powerful tool for predicting circRNA-disease associations.

Fig. 1.
figure 1

The flowchart of BWHCDA method.

2 Methods

2.1 Human CircRNA-Disease Associations

The experimentally validated human circRNA-disease associations are extracted from the CircR2Disease database [13]. Then, we choose the associations that circRNAs have been recorded in circBase database [14] and disease name recorded in disease ontology database [15]. Finally, we retained 371 circRNA-disease associations between 325 circRNAs and 53 diseases as the gold standard dataset. The circRNA-disease adjacency matrix A(i,j) is established, if there is an association between circRNA and disease, A(i,j) is set as 1, otherwise 0.

2.2 CircRNA Regulatory Similarity

The miRNA-circRNA interactions are downloaded from the CircBank database [16], and the interactions overlapped with disease-related circRNAs are selected to measure the regulatory similarity of circRNAs. It is measured as follows:

$$ SC\_RG(c_{i} ,c_{j} ) = \frac{{card(M_{i} \cap M_{j} )}}{{\sqrt {card(M_{i} )} \cdot \sqrt {card(M_{j} )} }} $$
(1)

where the set of Mi have relationship with circRNA ci and the set of miRNA Mj have relationship with circRNA cj.

2.3 Disease Semantic Similarity

The disease names are described as hierarchical directed acyclic graph (DAG) based on the Medical Subject Headings (MeSH) descriptions for diseases. And disease semantic similarity is calculated by the DOSE [17] tool with Wang et al. method.

2.4 GIP Kernel Similarity

Based on the assumption that similar circRNAs (diseases) are tend to have similar interaction or non-interaction pattern with diseases (circRNAs) [18], the GIP kernel similarity for circRNAs and diseases are respectively calculated as follows:

$$ \begin{array}{*{20}c} {SC\_cGIP(c(i),c(j)) = exp( - \gamma_{c} \parallel c(i) - c(j)\parallel^{2} )} \\ {\gamma_{c} = \frac{1}{{(\frac{1}{{n_{c} }}\sum\limits_{i = 1}^{{n_{c} }} {\parallel c(i)\parallel^{2} )} }}} \\ \end{array} $$
(2)
$$ \begin{array}{*{20}c} {SD\_dGIP(d(i),d(j)) = exp( - \gamma_{d} \parallel d(i) - d(j)\parallel^{2} )} \\ {\gamma_{d} = \frac{1}{{(\frac{1}{{n_{d} }}\sum\limits_{i = 1}^{{n_{d} }} {\parallel d(i)\parallel^{2} )} }}} \\ \end{array} $$
(3)

where c(i) (or d(i)) denotes the circRNA (disease) interaction profiles, which is the i-th row (column) of the adjacency matrix A. The parameters γc and γd are used to control the kernel bandwidth. nc (or nd) is the number of circRNAs (diseases).

2.5 Integrated Similarity for CircRNAs and Diseases

The new circRNA similarity scores (SC) are calculated with the average scores of the circRNA regulatory similarity and GIP kernel similarity for circRNAs. Similarly, the integrated disease similarity (SD) is denoted as the mean of the disease semantic similarity and GIP kernel similarity for diseases. Then, the integrated circRNA similarity and integrated disease similarity are adjusted with the logistic function [19].

$$ S(x) = \frac{1}{{1 + e^{cx + d} }} $$
(4)

where x is the value of element of matrix SC or SD. Parameters c and d control the adjustment effects, and we set c as −15 and set d as log(9999), respectively.

2.6 The Construction of Heterogeneous Network

According to the circRNA similarity and disease similarity measures, the circRNA network and disease network can be constructed. Next, the weighted heterogeneous circRNA-disease network is constructed based on the circRNA network, disease network via gold standard circRNA-disease associations. The heterogeneous network could be considered as a bipartite graph, the nodes represent circRNAs or diseases, the edges represent three types of interactions of circRNA-circRNA, disease-disease and circRNA-disease.

2.7 BWHCDA Method

Based on the topology and structure characteristics of circRNA network and disease network, the concept of CBG was introduced. A CBG is described as a subgraph of a circRNA path {c1, c2, …, cn} and a disease path {d1, d2, …, dm}, in which the two ends connected by circRNA-disease associations (c1, d1) and (cn, dm). The CBG indicates a vicinity relation between the two association (c1, d1) and (cn, dm), which is generalized by their distance to other associations in the circRNA network and disease network. The length of CBG patterns (l, r) is determined by the longer length of circRNA path and the disease path. In this study, we hypothesize that most potential associations tend to be covered by many shorter CBGs in the unknown circRNA-disease network. If there are more CBG patterns between circRNAs and diseases, the higher possibility of circRNA-disease associations are.

By iteratively adding the circRNA path and disease path, we calculates the CBGs weighted by decay factor α that ranges from 0 to 1. Because of different structures and topologies in the circRNA network and disease network, disparate optimal number of random walk steps are generated. Therefore, parameters l and r are introduced to restrict the number of random steps in circRNA similarity network and disease similarity network, respectively. The iterative process of bi-random walk is described as follows:

$$ {\text{On the circRNA network:}}\,Cc = \alpha \cdot SC_{L} \cdot CD_{t - 1} + (1 - \alpha )A $$
(5)
$$ {\text{On the disease network:}}\,Dd = \alpha \cdot CD_{t - 1} \cdot SD_{L} + (1 - \alpha )A $$
(6)

where α is the decay factor that controls the importance of CBG for different paths, SLL and SDL represent the normalized matrix by using Laplace regularization.

$$ SC_{L} = Dc^{ - 1/2} (S_{c} )Dc^{ - 1/2} $$
(7)
$$ SD_{L} = Dd^{ - 1/2} (S_{d} )Dd^{ - 1/2} $$
(8)

where Dc(i,i) (or Dd(i,i)) is the diagonal matrix of circRNA similarity matrix Sc (Sd).

By combining the propagation scores of matrices Cc and Dd, the relevance scores of unknown circRNA-disease associations could be obtained. The BWHCDA algorithm is outlined as Table 1.

Table 1. The pseudocode of BWHCDA algorithm

3 Results

3.1 Prediction Performance

To assess the performance of BWHCDA method, leave-one-out cross validation (LOOCV) and 5-fold cross validation (10-fold CV) framework are performed on the gold standard datasets. For LOOCV, each known circRNA-disease association is removed in turn as testing sample, and the other associations are regarded as training samples. Then, the unknown circRNA-disease associations are considered as candidate associations, and the prediction performance is assessed by the predicted rank of test sample. In the framework of 10-fold CV, circRNA-disease associations are randomly divided into ten subsets, and each subset is utilized in turn as test set and the remaining as the train set on each time. To decrease the sample division bias, we perform 100 times repetitions of 10-fold CV. The receiver operating characteristic (ROC) curves are plotted to show the prediction performance by calculating the true positive rate (TPR) and false positive rate (FPR). Furthermore, the area under the curves (AUCs) are calculated to evaluate the overall performance.

3.2 Effects of Parameters

There are three parameters in the BWHCDA method, including α, Il, Ir. To test the effects of the three parameters, we set α value as {0.2, 0.4, 0.6, 0.8}, and Il, Ir are set from 1 to 5, respectively. Then, we could calculate AUC values based on LOOCV and the effects of these parameters are shown in Tables 2, 3, 4 and 5. The results indicate that α has little effects on prediction performance. When α = 0.4, Il = 4 and Ir = 5, the AUC value of LOOCV is the highest with step length less than five. When α = 0.4, Il = 3, Ir = 4, AUC value of LOOCV is the highest within step length less than four. The AUC value of LOOCV is the highest within step length than three steps when α = 0.4, Il = 2, Ir = 3. And when α = 0.6, Il = 1, Ir = 2, the AUC value of LOOCV is the highest within two steps. Finally, we set three parameters as α = 0.4, Il = 2, Ir = 3, respectively.

Table 2. When α is set as 0.2, the effect of parameters Il and Ir for LOOCV AUC.
Table 3. When α is set as 0.4, the effect of parameters Il and Ir for LOOCV AUC.
Table 4. When α is set as 0.6, the effect of parameters Il and Ir for LOOCV AUC.
Table 5. When α is set as 0.8, the effect of parameters Il and Ir for LOOCV AUC.

3.3 Comparison with Other Methods

To further evaluate the prediction performance of BWHCDA, we compare it with other five methods including KATZHCDA [9], PageRank [20], NCP [21], BDSILP [22] and HeteSim [23]. Consequently, BWHCDA method achieve the best performance among these six approaches based on AUC values of LOOCV and 10-fold CV with the same datasets (Figs. 2 and 3). Therefore, BWHCDA method is better than other five methods.

Fig. 2.
figure 2

Comparison of BWHCDA and other methods in terms of ROC curves in LOOCV.

Fig. 3.
figure 3

Comparison of BWHCDA and other methods in terms of ROC curves in 10-fold CV.

3.4 Case Studies

To further assess the prediction performance of BWHCDA method, we analyze the predicted hsa_circ_0000519-gastric cancer association. As shown in Fig. 4, hsa_circ_0000519 may interact with miRNAs including hsa-miR-1233, hsa-miR-1258, hsa-miR-1296, hsa-miR-146b-3p, hsa-miR-521 to play their biological roles. The miRNA targets gene of these miRNAs have been validated related with gastric cancer, including hsa-miR-1258 target HPSE, hsa-miR-146b-3p target PER1 and IRAK1, hsa-miR-521 target ERCC8, hsa-miR-1296-5p target ERBB2. In addition, hsa-miR-1296 has been validated associated with gastric cancer. Therefore, hsa_circ_0000519 may be a potential biomarker for gastric diagnosis and prognosis.

Fig. 4.
figure 4

The hsa_circ_0000519-miRNA-mRNA-gastric cancer interaction network.

4 Conclusion

Prioritizing the potential associations between circRNAs and diseases is benefit to the development of the understanding of the disease mechanism, diagnose and treatment for diseases. The reasons that why BWHCDA method has better performance is shown as following aspects. First, bi-random explored the CBG patterns with iteratively implement random walk on the circRNA similarity network and disease similarity network. In addition, BWHCDA is a multi-task learning method that could forecast potential circRNA-disease associations simultaneously rather than mine candidate circRNAs for specific diseases. Therefore, BWHCDA could be an effective method for biomedical research.