{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T07:44:51Z","timestamp":1740123891358,"version":"3.37.3"},"reference-count":40,"publisher":"Springer Science and Business Media LLC","issue":"4","license":[{"start":{"date-parts":[[2022,3,18]],"date-time":"2022-03-18T00:00:00Z","timestamp":1647561600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,3,18]],"date-time":"2022-03-18T00:00:00Z","timestamp":1647561600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Hochschule Albstadt-Sigmaringen"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Ann Math Artif Intell"],"published-print":{"date-parts":[[2022,4]]},"abstract":"Abstract<\/jats:title>Supervised learning in neural nets means optimizing synaptic weights W<\/jats:bold> such that outputs y<\/jats:bold>(x<\/jats:bold>;W<\/jats:bold>) for inputs x<\/jats:bold> match as closely as possible the corresponding targets t<\/jats:bold> from the training data set. This optimization means minimizing a loss function ${\\mathscr{L}}(\\mathbf {W})$<\/jats:tex-math>\n L<\/mml:mi>\n (<\/mml:mo>\n W<\/mml:mi>\n )<\/mml:mo>\n <\/mml:math><\/jats:alternatives><\/jats:inline-formula> that usually motivates from maximum-likelihood principles, silently making some prior assumptions on the distribution of output errors y<\/jats:bold> \u2212t<\/jats:bold>. While classical crossentropy loss assumes triangular error distributions, it has recently been shown that generalized power error loss functions can be adapted to more realistic error distributions by fitting the exponent q<\/jats:italic> of a power function used for initializing the backpropagation learning algorithm. This approach can significantly improve performance, but computing the loss function requires the antiderivative of the function f<\/jats:italic>(y<\/jats:italic>) := y<\/jats:italic>q<\/jats:italic>\u2212\u20091<\/jats:sup>\/(1 \u2212 y<\/jats:italic>) that has previously been determined only for natural $q\\in \\mathbb {N}$<\/jats:tex-math>\n q<\/mml:mi>\n \u2208<\/mml:mo>\n \u2115<\/mml:mi>\n <\/mml:math><\/jats:alternatives><\/jats:inline-formula>. In this work I extend this approach for rational q<\/jats:italic> = n<\/jats:italic>\/2m<\/jats:italic><\/jats:sup> where the denominator is a power of 2. I give closed-form expressions for the antiderivative ${\\int \\limits } f(y) dy$<\/jats:tex-math>\n \u222b<\/mml:mo>\n f<\/mml:mi>\n (<\/mml:mo>\n y<\/mml:mi>\n )<\/mml:mo>\n d<\/mml:mi>\n y<\/mml:mi>\n <\/mml:math><\/jats:alternatives><\/jats:inline-formula> and the corresponding loss function. The benefits of such an approach are demonstrated by experiments showing that optimal exponents q<\/jats:italic> are often non-natural, and that error exponents q<\/jats:italic> best fitting output error distributions vary continuously during learning, typically decreasing from large q<\/jats:italic> >\u20091 to small q<\/jats:italic> <\u20091 during convergence of learning. These results suggest new adaptive learning methods where loss functions could be continuously adapted to output error distributions during learning.<\/jats:p>","DOI":"10.1007\/s10472-022-09786-2","type":"journal-article","created":{"date-parts":[[2022,3,18]],"date-time":"2022-03-18T20:02:25Z","timestamp":1647633745000},"page":"425-452","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["On the antiderivatives of xp\/(1 \u2212 x) with an application to optimize loss functions for classification with neural networks"],"prefix":"10.1007","volume":"90","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2534-0250","authenticated-orcid":false,"given":"Andreas","family":"Knoblauch","sequence":"first","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2022,3,18]]},"reference":[{"key":"9786_CR1","unstructured":"Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al: Tensorflow: A system for large-scale machine learning. In: 12Th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265\u2013283 (2016)"},{"issue":"2","key":"9786_CR2","first-page":"11","volume":"17","author":"R AlAhmad","year":"2020","unstructured":"AlAhmad, R., Almefleh, H.: Antiderivatives and integrals involving incomplete beta functions with applications. Aust. J. Math. Anal. Appl. 17(2), 11 (2020)","journal-title":"Aust. J. Math. Anal. Appl."},{"issue":"2","key":"9786_CR3","doi-asserted-by":"publisher","first-page":"157","DOI":"10.1109\/72.279181","volume":"5","author":"Y Bengio","year":"1994","unstructured":"Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157\u2013166 (1994)","journal-title":"IEEE Trans. Neural Netw."},{"key":"9786_CR4","volume-title":"Pattern Recognition and Machine Learning","author":"C Bishop","year":"2006","unstructured":"Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)"},{"key":"9786_CR5","volume-title":"Applied Optimal Control: Optimization, Estimation, and Control","author":"AE Bryson","year":"1969","unstructured":"Bryson, A.E., Ho, Y.C.: Applied Optimal Control: Optimization, Estimation, and Control. Blaisdell, New York (1969)"},{"key":"9786_CR6","unstructured":"Chollet, F.: Keras. https:\/\/github.com\/fchollet\/keras (2015)"},{"key":"9786_CR7","doi-asserted-by":"publisher","DOI":"10.1002\/0471200611","volume-title":"Elements of Information Theory","author":"T Cover","year":"1991","unstructured":"Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (1991)"},{"key":"9786_CR8","doi-asserted-by":"publisher","first-page":"450","DOI":"10.1553\/etna_vol48s450","volume":"48","author":"C Ferreira","year":"2018","unstructured":"Ferreira, C., Lopez, J., Perez Sinusia, E.: Uniform representations of the incomplete beta function in terms of elementary functions. Electron. Trans. Numer. Anal. 48, 450\u2013461 (2018)","journal-title":"Electron. Trans. Numer. Anal."},{"issue":"10","key":"9786_CR9","doi-asserted-by":"publisher","first-page":"2451","DOI":"10.1162\/089976600300015015","volume":"12","author":"F Gers","year":"2000","unstructured":"Gers, F., Schmidhuber, J., Cummins, F.: Learning to forget: Continual prediction with LSTM. Neural Comput. 12(10), 2451\u20132471 (2000)","journal-title":"Neural Comput."},{"key":"9786_CR10","unstructured":"Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y., Titterington, M. (eds.) Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 9, pp 249\u2013256. JMLR Workshop and Conference Proceedings, Chia Laguna Resort, Sardinia, Italy (2010)"},{"issue":"3","key":"9786_CR11","first-page":"200","volume":"103","author":"I Good","year":"1956","unstructured":"Good, I.: Some terminology and notation in information theory. Proceedings of the IEE - Part C: Monographs 103(3), 200\u2013204 (1956)","journal-title":"Proceedings of the IEE - Part C: Monographs"},{"key":"9786_CR12","unstructured":"Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press. http:\/\/www.deeplearningbook.org (2016)"},{"key":"9786_CR13","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)","DOI":"10.1109\/ICCV.2015.123"},{"key":"9786_CR14","doi-asserted-by":"publisher","unstructured":"He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770\u2013778. IEEE Computer Society. https:\/\/doi.org\/10.1109\/CVPR.2016.90 (2016)","DOI":"10.1109\/CVPR.2016.90"},{"issue":"8","key":"9786_CR15","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735\u20131780 (1997)","journal-title":"Neural Comput."},{"key":"9786_CR16","unstructured":"Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp 448\u2013456. PMLR, Lille, France (2015)"},{"key":"9786_CR17","doi-asserted-by":"crossref","unstructured":"Janocha, K., Czarnecki, W.: On loss functions for deep neural networks in classification. arXiv:1702.05659 (2017)","DOI":"10.4467\/20838476SI.16.004.6185"},{"key":"9786_CR18","unstructured":"Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd Proceedings of the International Conference on Learning Representations (ICLR), arXiv:1412.6980v9 (2015)"},{"issue":"8","key":"9786_CR19","doi-asserted-by":"publisher","first-page":"2193","DOI":"10.1162\/neco_a_01407","volume":"33","author":"A Knoblauch","year":"2021","unstructured":"Knoblauch, A.: Power function error initialization can improve convergence of backpropagation learning in neural networks for classification. Neural Comput. 33(8), 2193\u20132225 (2021)","journal-title":"Neural Comput."},{"key":"9786_CR20","unstructured":"Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep., Department of Computer Science University of Toronto (2009)"},{"key":"9786_CR21","unstructured":"Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems. https:\/\/proceedings.neurips.cc\/paper\/2012\/file\/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf, vol. 25. Curran Associates, Inc. (2012)"},{"key":"9786_CR22","unstructured":"Krotov, D., Hopfield, J.: Dense associative memory for pattern recognition. arXiv:1606.01164 (2016)"},{"issue":"2","key":"9786_CR23","doi-asserted-by":"publisher","first-page":"146","DOI":"10.1007\/BF01931367","volume":"16","author":"S Linnainmaa","year":"1976","unstructured":"Linnainmaa, S.: Taylor expansion of the accumulated rounding error. BIT Numer. Math. 16(2), 146\u2013160 (1976)","journal-title":"BIT Numer. Math."},{"key":"9786_CR24","unstructured":"Mathematika: Version 12.3.1. Wolfram Research, Inc., Champaign, IL (2021). https:\/\/www.wolfram.com\/mathematica"},{"key":"9786_CR25","unstructured":"MATLAB: version 9.7.0.1247435 (R2019b). The MathWorks Inc., Natick, Massachusetts (2019)"},{"key":"9786_CR26","unstructured":"Maxima: Maxima, a computer algebra system. version 5.43.2 (2020). http:\/\/maxima.sourceforge.net\/. See also www.integral-calculator.com"},{"key":"9786_CR27","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-29075-6","volume-title":"Novelty, Information and Surprise","author":"G Palm","year":"2012","unstructured":"Palm, G.: Novelty, Information and Surprise. Springer, Berlin (2012)"},{"key":"9786_CR28","unstructured":"Parker, D.: Learning-logic: casting the cortex of the human brain in silicon. Tech. Rep. Tr-47, Center for Computational Research in Economics and Management Science. MIT Cambridge MA (1985)"},{"key":"9786_CR29","unstructured":"Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d\u2019Alch\u00e9Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32. http:\/\/papers.neurips.cc\/paper\/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf, pp 8024\u20138035. Curran Associates, Inc. (2019)"},{"key":"9786_CR30","doi-asserted-by":"crossref","unstructured":"Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, vol. 9351, pp 234\u2013241. Springer, Berlin (2015)","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"9786_CR31","unstructured":"Ruder, S.: An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2016)"},{"issue":"6088","key":"9786_CR32","doi-asserted-by":"publisher","first-page":"533","DOI":"10.1038\/323533a0","volume":"323","author":"D Rumelhart","year":"1986","unstructured":"Rumelhart, D., Hinton, G., Williams, R.: Learning representations by back-propagating errors. Nature 323(6088), 533\u2013536 (1986)","journal-title":"Nature"},{"key":"9786_CR33","doi-asserted-by":"crossref","unstructured":"Rumelhart, D., McClelland, J., Group, P.R. (eds.): Parallel Distributed Processing, Explorations in the Microstructure of Cognition, vol. 1. Foundations. MIT Press, Cambridge (1986)","DOI":"10.7551\/mitpress\/5236.001.0001"},{"key":"9786_CR34","doi-asserted-by":"publisher","first-page":"85","DOI":"10.1016\/j.neunet.2014.09.003","volume":"61","author":"J Schmidhuber","year":"2015","unstructured":"Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85\u2013117 (2015)","journal-title":"Neural Netw."},{"key":"9786_CR35","unstructured":"Shannon, C., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press, Urbana\/Chicago (1949)"},{"key":"9786_CR36","volume-title":"The Scientist and Engineer\u2019s Guide to Digital Signal Processing","author":"S Smith","year":"1997","unstructured":"Smith, S.: The Scientist and Engineer\u2019s Guide to Digital Signal Processing. California Technical Publishing, San Diego (1997)"},{"issue":"56","key":"9786_CR37","first-page":"1929","volume":"15","author":"N Srivastava","year":"2014","unstructured":"Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929\u20131958 (2014)","journal-title":"J. Mach. Learn. Res."},{"key":"9786_CR38","unstructured":"Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp 6105\u20136114. PMLR (2019)"},{"issue":"132","key":"9786_CR39","doi-asserted-by":"publisher","first-page":"1109","DOI":"10.1090\/S0025-5718-1975-0387674-2","volume":"29","author":"N Tenne","year":"1975","unstructured":"Tenne, N.: Uniform asymptotic expansions of the incomplete gamma functions and the incomplete beta function. Math. Comput. 29(132), 1109\u20131114 (1975)","journal-title":"Math. Comput."},{"key":"9786_CR40","unstructured":"Werbos, P.J.: Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Ph.D. thesis, Harvard University (1974)"}],"container-title":["Annals of Mathematics and Artificial Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10472-022-09786-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10472-022-09786-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10472-022-09786-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,4,4]],"date-time":"2022-04-04T02:08:29Z","timestamp":1649038109000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10472-022-09786-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,18]]},"references-count":40,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2022,4]]}},"alternative-id":["9786"],"URL":"https:\/\/doi.org\/10.1007\/s10472-022-09786-2","relation":{},"ISSN":["1012-2443","1573-7470"],"issn-type":[{"type":"print","value":"1012-2443"},{"type":"electronic","value":"1573-7470"}],"subject":[],"published":{"date-parts":[[2022,3,18]]},"assertion":[{"value":"10 January 2022","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 March 2022","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The author declares that he has no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of Interests"}}]}}