Abstract
We establish a deep learning theory for distribution regression with deep convolutional neural networks (DCNNs). Deep learning based on structured deep neural networks has been powerful in practical applications. Generalization analysis for regression with DCNNs has been carried out very recently. However, for the distribution regression problem in which the input variables are probability measures, there is no mathematical model or theoretical analysis of DCNN-based learning theory. One of the difficulties is that the classical neural network structure requires the input variable to be a Euclidean vector. When the input samples are probability distributions, the traditional neural network structure cannot be directly used. A well-defined DCNN framework for distribution regression is desirable. In this paper, we overcome the difficulty and establish a novel DCNN-based learning theory for a two-stage distribution regression model. Firstly, we realize an approximation theory for functionals defined on the set of Borel probability measures with the proposed DCNN framework. Then, we show that the hypothesis space is well-defined by rigorously proving its compactness. Furthermore, in the hypothesis space induced by the general DCNN framework with distribution inputs, by using a two-stage error decomposition technique, we derive a novel DCNN-based two-stage oracle inequality and optimal learning rates (up to a logarithmic factor) for the proposed algorithm for distribution regression.
Similar content being viewed by others
References
Barron, A.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39, 930–945 (1993)
Bauer, F., Pereverzev, S., Rosasco, L.: On regularization algorithms in learning theory. J. Complex. 23(1), 52–72 (2007)
Stone, C.J.: Optimal global rates of convergence for nonparametric regression. Ann. Stat. 1040-1053 (1982)
Christmann, A., Steinwart, I.: Consistency and robustness of kernel-based regression in convex risk minimization. Bernoulli 13(3), 799–819 (2007)
Chui, C.K., Li, X., Mhaskar, H.N.: Neural networks for localized approximation. Math. Comput. 63(208), 607–623 (1994)
Chui, C.K., Lin, S.B., Zhou, D.X.: Deep neural networks for rotation-invariance approximation and learning. Anal. Appl. 17, 737–772 (2019)
Cucker, F., Smale, S.: On the mathematical foundations of learning. Bulletin Amer. Math. Soc. 39, 1–49 (2002)
Cucker, F., Zhou, D.X.: Learning theory: an approximation theory viewpoint. Cambridge University Press, Cambridge (2007)
DiBenedetto, E., Debenedetto, E.: Real analysis. Birkhäuser, Boston (2002)
Dong, S.N., Sun, W.C.: Distributed learning and distribution regression of coefficient regularization. J. Approx. Theory 263, 105523 (2021)
Dong, S.N., Sun, W.C.: Learning rate of distribution regression with dependent samples. J. Complex. 101679, (2022)
Fang, Z.Y., Guo, Z.C., Zhou, D.X.: Optimal learning rates for distribution regression. J. Complex. 56, 101426 (2020)
Feng, Y.L., Huang, X.L., Shi, L., Yang, Y.N., Suykens, J.A.K.: Learning with the maximum correntropy criterion induced losses for regression. J. Mach. Learn. Res. 16(30), 993–1034 (2015)
Guo, Z.C., Shi, L.: Optimal rates for coefficient-based regularized regression. Appl. Comput. Harmon. Anal. 47(3), 662–701 (2019)
Guo, Z.C., Lin, S.B., Shi, L.: Distributed learning with multi-penalty regularization. Appl. Comput. Harmon. Anal. 46(3), 478–499 (2019)
Hayakawa, S., Suzuki, T.: On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces. Neural Netw. 123, 343–361 (2020)
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989)
Klusowski, J.M., Barron, A.R.: Approximation by combinations of ReLU and squared ReLU ridge functions with \(\ell ^ 1\) and \(\ell ^ 0\) controls. IEEE Trans. Inf. Theory 64(12), 7649–7656 (2018)
Lin, Y.V., Pinkus, A.: Fundamentality of ridge functions. J. Approx. Theory 75, 295–311 (1993)
Maiorov, V.E.: On best approximation by ridge functions. J. Approx. Theory 99, 68–94 (1999)
Mao, T., Shi, Z.J., Zhou, D.X.: Theory of deep convolutional neural networks III: approximating radial functions. Neural Netw. 144, 778–79 (2021)
Mücke, N.: Stochastic gradient descent meets distribution regression. In International Conference on Artificial Intelligence and Statistics (pp. 2143-2151). PMLR, (2021)
Mhaskar, H.N.: Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1, 61–80 (1993)
Mhaskar, H.N.: Dimension independent bounds for general shallow networks. Neural Netw. 123, 142–152 (2020)
Póczos, B., Singh, A., Rinaldo, A., Wasserman, L.: Distribution-free distribution regression. In Artificial Intelligence and Statistics (pp. 507-515). PMLR, (2013)
Steinwart, I., Christmann, A.: Support vector machines. Springer Science & Business Media, (2008)
Schmidt-Hieber, J.: Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 48(4), 1875–1897 (2020)
Smale, S., Zhou, D.X.: Shannon sampling and function reconstruction from point values. Bull. Am. Math. Soc. 41(3), 279–305 (2004)
Smale, S., Zhou, D.X.: Learning theory estimates via integral operators and their approximations. Constr. Approx. 26(2), 153–172 (2007)
Szabó, Z., Gretton, A., Póczos, B., Sriperumbudur, B.: Two-stage sampled learning theory on distributions. In Artificial Intelligence and Statistics (pp. 948-957). PMLR, (2015)
Szabó, Z., Sriperumbudur, B.K., Póczos, B., Gretton, A.: Learning theory for distribution regression. J. Mach. Learn. Res. 17, 5272–311 (2016)
Villani, C.: Optimal transport: old and new (Vol. 338, p. 23). Berlin: Springer (2009)
Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Networks 94, 103–114 (2017)
Yu, Z., Ho, D.W.C., Shi, Z.J., Zhou, D.X.: Robust kernel-based distribution regression. Inverse Problems 37(10), 105014 (2021)
Zhao, P., Zhou, Z.H.: Label distribution learning by optimal transport. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1) (2018)
Zhou, D.X.: Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 48(2), 787–794 (2020)
Zhou, D.X.: Deep distributed convolutional neural networks: universality. Anal. Appl. 16, 895–919 (2018)
Zhou, D.X.: Theory of deep convolutional neural networks: Downsampling. Neural Netw. 124, 319–327 (2020)
Zweig, A., Bruna, J.: A functional perspective on learning symmetric functions with neural networks. Int. Conf. Mach. Learn. (pp. 13023-13032). PMLR, (2021)
Acknowledgements
The first version of the paper was written when the second author worked at City University of Hong Kong, supported partially by the Laboratory for AI-Powered Financial Technologies, the Research Grants Council of Hong Kong [Projects # C1013-21GF and #11308121], the Germany/Hong Kong Joint Research Scheme [Project No. G-CityU101/20], the CityU Strategic Interdisciplinary Research Grant [Project No. 7020010], National Science Foundation of China [Project No. 12061160462], and Hong Kong Institute for Data Science. The first author would like to thank Zhongjie Shi for nice communications and discussions on related topics. The authors would like to thank the anonymous referee for his/her careful review which helps improve the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest statement
The authors declare no competing interests.
Additional information
Communicated by: Siddhartha Mishra
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 The sequence factorization lemma
Lemma 1
([36]) Let \(s\ge 2\) and \(W=(W_k)_{k=-\infty }^{\infty }\) be a sequence supported in \(\{0,1,...,\mathcal {M}\}\) with \(\mathcal {M}\ge 0\). Then there exists a finite sequence of filters \(\{\omega ^{(j)}\}_{j=1}^p\) each supported in \(\{0,1,...,s\}\) with \(p\le \lceil \frac{\mathcal {M}}{s-1}\rceil \) such that the following convolutional factorization holds true
1.2 The Cauchy bound of polynomial roots and Vieta’s formula of polynomial coefficients
Lemma 2
If \(w=\left\{ w_{j}\right\} _{j \in \mathbb {Z}}\) is a real sequence supported in \(\{0, \ldots , K\}\) with \(w_{K}=1\), then all the complex roots of its symbol \(\widetilde{w}(z)=\sum _{j=0}^{K} w_{j} z^{j}\) are located in the disk of radius \(1+\max _{j=0, \ldots , K-1}\left| w_{j}\right| \), the Cauchy bound of \(\widetilde{w}\). If we factorize \(\widetilde{w}\) into monic polynomials of degree at most s, then all the coefficients of these factor polynomials are bounded by \(s^{\frac{s}{2}}\left( 1+\max _{j=0, \ldots , K-1}\left| w_{j}\right| \right) ^{s} \le \) \(s^{\frac{s}{2}}\left( 1+\Vert w\Vert _{\infty }\right) ^{s}\).
1.3 The covering number bound of the function set \(\mathcal {H}_{J+1}^l\)
Lemma 3
For any \(l\in \{1,2,...,(2N+3)\widetilde{d}\}\) with \(\widetilde{d}=\lfloor (d+J s) / d\rfloor \). The covering number of \(\mathcal {H}_{J+1}^l\) satisfies
where \(\mathcal {A}_1\), \(\mathcal {A}_2\) and \(\widehat{R}\) are defined as in Lemma 6.
Proof
Since the function set \(\mathcal {H}_{J+1}^l\) inherits the substructure up to \((J+1)\)th-layer of the hypothesis space \(\mathcal {H}_{R, N}^{\mathcal {P}(\Omega )}\), the proof can be derived by following the similar procedures of the proof in Lemma 4 and Lemma 6.
For any \(l\in \{1,2,...,(2N+3)\widetilde{d}\}\) and any \(H\in \mathcal {H}_{J+1}^l\), if we choose another function \(\widetilde{H}\) in \(\mathcal {H}_{J+1}^l\) induced by \(\widetilde{\omega }^{(j)}\) for \(j=1,2,...,J\), \(\widetilde{b}^{(j)}\) for \(j=1,2,...,J+1\), \(\widetilde{F}^{(J+1)}\), satisfying the restriction that \(\left\| \omega ^{(j)}-\widetilde{\omega }^{(j)}\right\| _{\infty }\le \epsilon \) for \(j=1,2,...,J\), \(\left\| b^{(j)}-\widetilde{b}^{(j)}\right\| _{\infty }\le \epsilon \) for \(j=1,2,...,J+1\), \(\left\| F^{(J+1)}-\widetilde{F}^{(J+1)}\right\| _{\infty }\le \epsilon \), \(\left\| c-\widetilde{c}\right\| _{\infty }\le \epsilon \), then via a similar way of getting 3.18 in Lemma 6, we have
We observe that
which is defined in 3.19 and that the free parameters till \((J+1)\)-th layer in \(\mathcal {H}_{J+1}^j\) is not larger than free parameters till \((J+2)\)-th layer in \(\mathcal {H}_{R,N}^{\mathcal {P}(\Omega )}\). Hence, by taking a \(\epsilon \) net of the set of free parameters in \(\omega ^{(j)}\) for \(j=1,2,...,J\), \(b^{(j)}\) for \(j=1,2,...,J+1\), and \(F^{(J+1)}\), we derive that
which completes the proof after taking the logarithm.\(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yu, Z., Zhou, DX. Deep learning theory of distribution regression with CNNs. Adv Comput Math 49, 51 (2023). https://doi.org/10.1007/s10444-023-10054-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10444-023-10054-y