Abstract
Approximation theory has achieved many results in the field of deep learning. However, these results and conclusions can rarely be directly applied to the solution of practical problems or directly guide the training and optimization of deep learning models in the actual context. To address this issue, we construct a CNN structure with universal approximation, which is called UniverApproCNN. It is ensured that the approximation error of such CNN is bounded by an explicit approximation upper bound that relies on the hyper parameters of this model. Moreover, a general case of multidimensional is considered by generalizing the conclusion of the universality property of CNN. A practical problem in the field of inertial guidance is used as a background to conduct experiments, so that the theory can give an explicit training strategy and break the barrier between the theory and its application. We use the curve similarity index defined by Fréchet distance to prove that the experimental results are highly consistent with the functional relationship given by the theory. On this basis, we define the ‘approximation coefficient’ of UniverApproCNN, which can give the stop time of model training and related training strategies. Specifically, taking the operation of normalization as a widely used technique into consideration, we then show that this operation does not take effect on the approximation performance of the CNN and UniverApproCNN.
Supported by special fund for scientific and technological innovation strategy of Guangdong Province.
Y. Yang, Y. Wang and S. Yang—These authors contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2(4), 303–314 (1989)
Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999)
Zhou, D.X., Jetter, K.: Approximation with polynomial kernels and SVM classifiers. Adv. Comput. Math. 25(1–3), 323–344 (2006)
Breiman, L.: Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans. Inf. Theory 39(3), 999–1013 (1993)
Klusowski, J.M., Barron, A.R.: Approximation by combinations of ReLU and squared ReLU ridge functions with \(l^{1}\) and \(l^{0}\) controls (2016)
Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930–945 (1993)
Klusowski, J.M., Barron, A.R.: Uniform approximation by neural networks activated by first and second order ridge splines. IEEE Trans. Inf. Theory (2016)
Hornik, K.M., Stinchcomb, M., White, H.: Multilayer feedforward networks are universal approximator. IEEE Trans. Neural Networks 2, 01 (1989)
Zhou, D.-X.: Deep distributed convolutional neural networks: Universality. Anal. Appl. 16(92), 895–919 (2018)
Mhaskar, H.N., Poggio, T.: Deep vs. shallow networks: an approximation theory perspective. Anal. Appl. 14(06), 829–848 (2016)
Mallat, S.: Understanding deep convolutional networks. Phil. Trans. R. Soc. A 374(2065), 20150203–20150203 (2016)
Steinwart, I., Thomann, P., Schmid, N.: Learning with hierarchical gaussian kernels. arXiv Machine Learning (2016)
Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1872–1886 (2013)
Zhou, D.: Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 48(2), 787–794 (2020)
Santurkar, S., Tsipras, D., Ilyas, A., Mądry, A.: How does batch normalization help optimization? In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 2488–2498 (2018)
Wang, Y., Wang, Y., Hu, G., Liu, Y., Zhao, Y.: Adaptive skewness kurtosis neural network: enabling communication between neural nodes within a layer, pp. 498–507 (2020)
Burger, M., Engl, H.W.: Training neural networks with noisy data as an Ill-posed problem. Adv. Comput. Math. 13(4), 335–354 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 Proof of the Theorem 3
Proof
Because of the density of \(H^{r}(\varOmega )\) in \(C(\varOmega )\) when \(r \ge \frac{nd}{2}+2\), without loss of generality, we assume \(f \in H^{r}(\varOmega )\). When the activation function \(\sigma (u)=max \{u,0\}, u \in \mathbb {R}\) is not considered, the input-related coefficients in the network are all included in the convolution matrix
given the following notation:
where \(W^{i}=w^{(J)}*\cdots *w^{(2)}*w^{(1),i}\) and \(T^{W^{i}}\) is the Toeplitz type matrix induced by \(W^{i}\).
Note that in the ramp ridge function, the coefficients associated with the first sub-block x(1 : n) of length n of input x are
the coefficients related to x(1 : n) in the network are included in \(T^{W^{1}}\).
Let \(\big (W^{1}_{(m+1)n-1},\cdots ,W^{1}_{1},W^{1}_{0}\big )\)
We have
Similarly, let \(\big (W^{i}_{(m+1)n-1},\cdots ,W^{i}_{1},W^{i}_{0}\big )\)
we can get
and
Since \(W^{i}=w^{(J)}*\cdots w^{(2)}*w^{(1),i}\) , using the generating function of a sequence and the basic theorem of algebra, we can decompose \(W^{(i)}\) into a finitely supported sequence \(\{w^{(j)}\}^{J}_{j=1}\) . So far, we have decomposed the coefficients \(\{\alpha _{0},\alpha _{1},\cdots ,\alpha _{m}\}\) in the ramp ridge function \(F_{m}(x)\) into the weight parameters of each layer of this particular network.
Note that when we decompose
a common factor of \(\widetilde{w^{(J)}}(z)\cdots \widetilde{w^{(2)}}(z)\) is required. When a untrivial common factor exists, we can decompose it into a sequence \(\{w^{(j)}\}^{J}_{j=2}\) with finite support.
If nonlinear activation is considered only in the J-th layer, let
what is different from the case in one-dimensional input is
We have
let \(w^{(J+1)}=\{w^{J+1}_{0}\}=\{1\}\) and
then \(T^{w^{(J+1)}}=I_{d_{J}}\) and
For a set of \(\mathbf {w}\), \(\mathbf {b}\) constructed above, there exists \(f^{\mathbf {w,b}}_{J+1} \in \mathcal {H}^{\mathbf {w,b}}_{J+1}\) , such that
If there is no untrivial common factor, we consider single convolutional layer, at this time, \(w^{(1),i}=W^{i}\) is a sequence of length \((m+1)d\), then \(s=(m+1)n-1\). In other words, this is a special case when \(J=1\), that is, there exists \(f^{\mathbf {w,b}}_{2} \in \mathcal {H}^{\mathbf {w,b}}_{2}\), such that
1.2 A.2 Model Training Results in the Trajectory Prediction Experiments
1.3 A.3Back Propagation Process of UniverApproCNN
Take the single-output CNN with two convolutional layers and a full connected layer considered above as an example, find the derivative of the loss function L with respect to \(w^{(1)}_{k}\), where \(L=(Y-y)^{2}\), and \(y=\sum _{l=1}^{d_{2}}c_{l}h^{(2)}_{l}\). The derivative of loss function \(L=(Y-y)^{2}\) with respect to \(w^{(1)}_{k}\) is
For fixed \(0 \le k\le s\),
Define \(a^{(j)}=T^{w^{(j)}}\cdots T^{w^{(1)}}x\), then essentially we find \(\frac{\partial a^{(2)}_{l} }{w^{(1)}_{k}}\) and \(\frac{\partial B^{(2)}}{\partial w^{(1)}_{k}}\). For fixed \(0 \le k\le s\), observe that when \(k+1 \le i \le k+d\), \(a^{(1)}_{i}\) contains items related to \(w^{(1)}_{k}\), and \(\frac{\partial a^{(1)}_{i} }{\partial w^{(1)}_{k}}=x_{i-k}\).
Further, when \(k+1 \le l \le k+d+s\), \(a^{(2)}_{l}\) contains items related to \(w^{(1)}_{k}\), and we only need to consider \(\sum _{i=k+1}^{k+d} w^{(2)}_{l-i}a^{(1)}_{i}\) in \(a^{(2)}_{l}\). Since \(w^{(2)}\) is finite supported, we only consider \(\sum _{i=max\{k+1,l-s\}}^{min\{k+d,l\}} w^{(2)}_{l-i}a^{(1)}_{i}\), then \(\frac{\partial a^{(2)}_{l} }{\partial a^{(1)}_{i}}=w^{(2)}_{l-i}\) and
It is easy to find that
Similar conclusions can be obtained for the CNN with depth \(J+1\) and single output.
Rights and permissions
Copyright information
© 2021 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Yang, Y., Wang, Y., Yang, S. (2021). A UniverApproCNN with Universal Approximation and Explicit Training Strategy. In: Gao, H., Wang, X. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2021. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 407. Springer, Cham. https://doi.org/10.1007/978-3-030-92638-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-92638-0_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92637-3
Online ISBN: 978-3-030-92638-0
eBook Packages: Computer ScienceComputer Science (R0)