Abstract
Classification of functional genomic regions (such as promoters or enhancers) based on sequence data alone is a very important problem. Various data mining algorithms can be used well to apply to predict the promoter region. For example, association and clustering algorithms like Classification And Regression Tree (CART), machine learning algorithms like Simple Logistic, BayesNet, Random forest, or the most popular deep learning like Recurrent Neural Network (RNN), Convolutional Neural Networks (CNN). However, due to large amount of genetic data are unlabeled, these methods cannot directly solve this challenge. Therefore, we present a three-layer dynamic transfer learning language model (TLDTLL) for E. coli promoter classification problems. TLDTLL is an effective algorithm for inductive transfer learning that utilizes pre-training on large unlabeled genomic corpuses. This is particularly advantageous in the context of genomics data, which tends to contain significant volumes of unlabeled data. TLDTLL shows improved results over existing methods for classification of E. coli promoters using only sequence data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Huang, D.-S.: Systematic Theory of Neural Networks for Pattern Recognition. Publishing House of Electronic Industry of China, Beijing, vol. 201 (1996)
Huang, D.-S., Zhao, X.-M., Huang, G.-B., Cheung, Y.-M.: Classifying protein sequences using hydropathy blocks. Pattern Recogn. 39, 2293–2300 (2006)
Umarov, R., Kuwahara, H., Li, Y., Gao, X., Solovyev, V.: Promid: human promoter prediction by deep learning (2018). arXiv preprint arXiv:1810.01414
Zhu, L., Guo, W.-L., Deng, S.-P., Huang, D.-S.: ChIP-PIT: Enhancing the analysis of ChIP-Seq data using convex-relaxed pair-wise interaction tensor decomposition. IEEE/ACM Trans. Comput. Biol. Bioinf. 13, 55–63 (2015)
Huang, D.-S.: Radial basis probabilistic neural networks: Model and application. Int. J. Pattern Recogn. Artif. Intell. 13, 1083–1101 (1999)
Huang, D.-S., Huang, X.: Improved performance in protein secondary structure prediction by combining multiple predictions. Protein Pept. Lett. 13, 985–991 (2006)
Huang, D.-S., Zheng, C.-H.: Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics 22, 1855–1862 (2006)
Huang, D.-S., Du, J.-X.: A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks. IEEE Trans. Neural Networks 19, 2099–2115 (2008)
Zheng, C.-H., Huang, D.-S., Zhang, L., Kong, X.-Z.: Tumor clustering using nonnegative matrix factorization with gene selection. IEEE Trans. Inf. Technol. Biomed. 13, 599–607 (2009)
Xia, J.-F., Zhao, X.-M., Song, J., Huang, D.-S.: APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. BMC Bioinf. 11, 174 (2010)
Zheng, C.-H., Zhang, L., Ng, V.T.-Y., Shiu, C.K., Huang, D.-S.: Molecular pattern discovery based on penalized matrix decomposition. IEEE/ACM Trans. Comput. Biol. Bioinf. 8, 1592–1603 (2011)
Huang, D.-S., Jiang, W.: A general CPL-AdS methodology for fixing dynamic parameters in dual environments. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 42, 1489–1500 (2012)
Deng, S.-P., Huang, D.-S.: SFAPS: An R package for structure/function analysis of protein sequences based on informational spectrum method. Methods 69, 207–212 (2014)
Huang, D.-S., Yu, H.-J.: Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids. IEEE/ACM Trans. Comput. Biol. Bioinf. 10, 457–467 (2013)
Zhu, L., You, Z.-H., Huang, D.-S., Wang, B.: t-LSE: a novel robust geometric approach for modeling protein-protein interaction networks. PLoS One 8, e58368 (2013)
Deng, S.-P., Zhu, L., Huang, D.-S.: Mining the bladder cancer-associated genes by an integrated strategy for the construction and analysis of differential co-expression networks. BMC Genomics 16(Suppl 3), S4 (2015)
Deng, S.-P., Zhu, L., Huang, D.-S.: Predicting hub genes associated with cervical cancer through gene co-expression networks. IEEE/ACM Trans. Comput. Biol. Bioinf. 13, 27–35 (2015)
Zhu, L., Deng, S.-P., Huang, D.-S.: A two-stage geometric method for pruning unreliable links in protein-protein networks. IEEE Trans. Nanobiosci. 14, 528–534 (2015)
Shen, Z., Bao, W., Huang, D.-S.: Recurrent neural network for predicting transcription factor binding sites. Sci. Rep. 8, 1–10 (2018)
Umarov, R.K., Solovyev, V.V.: Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS One 12, e0171410 (2017)
Min, X., Zeng, W., Chen, S., Chen, N., Chen, T., Jiang, R.: Predicting enhancers with deep convolutional neural networks. BMC Bioinf. 18, 478 (2017)
Yang, B., et al.: BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics 33, 1930–1936 (2017)
Cohn, D., Zuk, O., Kaplan, T.: Enhancer identification using transfer and adversarial deep learning of DNA sequences. BioRxiv 264200 (2018)
Liu, F., Li, H., Ren, C., Bo, X., Shu, W.: PEDLA: predicting enhancers with a deep learning-based algorithmic framework. Sci. Rep. 6, 28517 (2016)
Zeng, W., Wu, M., Jiang, R.: Prediction of enhancer-promoter interactions via natural language processing. BMC Genom. 19, 84 (2018)
Chuai, G., et al.: DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol. 19, 80 (2018)
Fiannaca, A., et al.: Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinf. 19, 198 (2018)
Plekhanova, E., Nuzhdin, S.V., Utkin, L.V., Samsonova, M.G.: Prediction of deleterious mutations in coding regions of mammals with transfer learning. Evol. Appl. 12, 18–28 (2019)
Baek, J., Lee, B., Kwon, S., Yoon, S.: Lncrnanet: long non-coding rna identification using deep learning. Bioinformatics 34, 3889–3897 (2018)
Trabelsi, A., Chaabane, M., Ben-Hur, A.: Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities. Bioinformatics 35, i269–i277 (2019)
Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 30–42 (2011)
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2009)
Trieschnigg, D., Kraaij, W., de Jong, F.: The influence of basic tokenization on biomedical document retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 803–804 (2007)
Jiang, J., Zhai, C.: An empirical study of tokenization strategies for biomedical information retrieval. Inf. Retrieval 10, 341–363 (2007)
Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30, 31–37 (2014)
Ghandi, M., Lee, D., Mohammad-Noori, M., Beer, M.A.: Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10, e1003711 (2014)
Koren, S., Walenz, B.P., Berlin, K., Miller, J.R., Bergman, N.H., Phillippy, A.M.: Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017)
Sherry, S.T., et al.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001)
Acknowledgement
This work was supported by the grant of National Key R&D Program of China (No. 2018YFA0902600 & 2018AAA0100100) and partly supported by National Natural Science Foundation of China (Grant nos. 61520106006, 61861146002, 61702371, 61932008, 61732012, 61772370, 61532008, 61672382, 61772357, and 61672203) and China Postdoctoral Science Foundation (Grant no. 2017M611619) and supported by “BAGUI Scholar” Program and the Scientific & Technological Base and Talent Special Program, GuiKe AD18126015 of the Guangxi Zhuang Autonomous Region of China and supported by Shanghai Municipal Science and Technology Major Project (No.2018SHZDZX01), LCNBI and ZJLab.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
He, Y. et al. (2020). Three-Layer Dynamic Transfer Learning Language Model for E. Coli Promoter Classification. In: Huang, DS., Jo, KH. (eds) Intelligent Computing Theories and Application. ICIC 2020. Lecture Notes in Computer Science(), vol 12464. Springer, Cham. https://doi.org/10.1007/978-3-030-60802-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-60802-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60801-9
Online ISBN: 978-3-030-60802-6
eBook Packages: Computer ScienceComputer Science (R0)