A UniverApproCNN with Universal Approximation and Explicit Training Strategy

Yang, Yin; Wang, Yifeng; Yang, Senqiao

doi:10.1007/978-3-030-92638-0_18

Yin Yang¹⁷,
Yifeng Wang¹⁷ &
Senqiao Yang¹⁷

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 407))

Included in the following conference series:

International Conference on Collaborative Computing: Networking, Applications and Worksharing

813 Accesses

Abstract

Approximation theory has achieved many results in the field of deep learning. However, these results and conclusions can rarely be directly applied to the solution of practical problems or directly guide the training and optimization of deep learning models in the actual context. To address this issue, we construct a CNN structure with universal approximation, which is called UniverApproCNN. It is ensured that the approximation error of such CNN is bounded by an explicit approximation upper bound that relies on the hyper parameters of this model. Moreover, a general case of multidimensional is considered by generalizing the conclusion of the universality property of CNN. A practical problem in the field of inertial guidance is used as a background to conduct experiments, so that the theory can give an explicit training strategy and break the barrier between the theory and its application. We use the curve similarity index defined by Fréchet distance to prove that the experimental results are highly consistent with the functional relationship given by the theory. On this basis, we define the ‘approximation coefficient’ of UniverApproCNN, which can give the stop time of model training and related training strategies. Specifically, taking the operation of normalization as a widely used technique into consideration, we then show that this operation does not take effect on the approximation performance of the CNN and UniverApproCNN.

Supported by special fund for scientific and technological innovation strategy of Guangdong Province.

Y. Yang, Y. Wang and S. Yang—These authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 10295; Price includes VAT (Japan)

Softcover Book: JPY 12869; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improve Robustness and Accuracy of Deep Neural Network with L_2,∞ Normalization

Article 15 September 2022

On the Performance of new Higher Order Transformation Functions for Highly Efficient Dense Layers

Article 23 July 2023

On a Fitting of a Heaviside Function by Deep ReLU Neural Networks

References

Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2(4), 303–314 (1989)
Article MathSciNet Google Scholar
Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999)
Article MathSciNet Google Scholar
Zhou, D.X., Jetter, K.: Approximation with polynomial kernels and SVM classifiers. Adv. Comput. Math. 25(1–3), 323–344 (2006)
Article MathSciNet Google Scholar
Breiman, L.: Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans. Inf. Theory 39(3), 999–1013 (1993)
Article MathSciNet Google Scholar
Klusowski, J.M., Barron, A.R.: Approximation by combinations of ReLU and squared ReLU ridge functions with $l^{1}$ and $l^{0}$ controls (2016)
Google Scholar
Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930–945 (1993)
Google Scholar
Klusowski, J.M., Barron, A.R.: Uniform approximation by neural networks activated by first and second order ridge splines. IEEE Trans. Inf. Theory (2016)
Google Scholar
Hornik, K.M., Stinchcomb, M., White, H.: Multilayer feedforward networks are universal approximator. IEEE Trans. Neural Networks 2, 01 (1989)
Article Google Scholar
Zhou, D.-X.: Deep distributed convolutional neural networks: Universality. Anal. Appl. 16(92), 895–919 (2018)
Article MathSciNet Google Scholar
Mhaskar, H.N., Poggio, T.: Deep vs. shallow networks: an approximation theory perspective. Anal. Appl. 14(06), 829–848 (2016)
Google Scholar
Mallat, S.: Understanding deep convolutional networks. Phil. Trans. R. Soc. A 374(2065), 20150203–20150203 (2016)
Article Google Scholar
Steinwart, I., Thomann, P., Schmid, N.: Learning with hierarchical gaussian kernels. arXiv Machine Learning (2016)
Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1872–1886 (2013)
Article Google Scholar
Zhou, D.: Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 48(2), 787–794 (2020)
Article MathSciNet Google Scholar
Santurkar, S., Tsipras, D., Ilyas, A., Mądry, A.: How does batch normalization help optimization? In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 2488–2498 (2018)
Google Scholar
Wang, Y., Wang, Y., Hu, G., Liu, Y., Zhao, Y.: Adaptive skewness kurtosis neural network: enabling communication between neural nodes within a layer, pp. 498–507 (2020)
Google Scholar
Burger, M., Engl, H.W.: Training neural networks with noisy data as an Ill-posed problem. Adv. Comput. Math. 13(4), 335–354 (2000)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Harbin Institute of Technology, Shenzhen Graduate School, Shenzhen, Guangdong Province, China
Yin Yang, Yifeng Wang & Senqiao Yang

Authors

Yin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yifeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Senqiao Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Shanghai University, Shanghai, China
Honghao Gao
Xi’an Jiaotong-Liverpool University, Suzhou, China
Xinheng Wang

A Appendix

1.1 A.1 Proof of the Theorem 3

Proof

Because of the density of $H^{r}(\varOmega )$ in $C(\varOmega )$ when $r \ge \frac{nd}{2}+2$, without loss of generality, we assume $f \in H^{r}(\varOmega )$. When the activation function $\sigma (u)=max \{u,0\}, u \in \mathbb {R}$ is not considered, the input-related coefficients in the network are all included in the convolution matrix

$$\begin{aligned} \big ( T^{w^{(J)}}\cdots T^{w^{(2)}}T^{w^{(1),1}},\cdots \cdots , T^{w^{(J)}}\cdots T^{w^{(2)}}T^{w^{(1),d}}\big ) , \end{aligned}$$

given the following notation:

$$\begin{aligned} \big ( T^{w^{(J)}}\cdots T^{w^{(2)}}T^{w^{(1),1}},\cdots \cdots , T^{w^{(J)}}\cdots T^{w^{(2)}}T^{w^{(1),d}}\big ) \end{aligned}$$

$$\begin{aligned} {\mathop {=}\limits ^{\bigtriangleup }}\big ( T^{W^{1}},\cdots \cdots , T^{W^{d}}\big ){\mathop {=}\limits ^{\bigtriangleup }}T \end{aligned}$$

where $W^{i}=w^{(J)}*\cdots *w^{(2)}*w^{(1),i}$ and $T^{W^{i}}$ is the Toeplitz type matrix induced by $W^{i}$.

Note that in the ramp ridge function, the coefficients associated with the first sub-block x(1 : n) of length n of input x are

$$\begin{aligned} \{ \alpha ^{1}_{m},\cdots ,\alpha ^{n}_{m},\cdots \cdots ,\alpha ^{1}_{1},\cdots ,\alpha ^{n}_{1},\alpha ^{1}_{0},\cdots ,\alpha ^{n}_{0}\}, \end{aligned}$$

the coefficients related to x(1 : n) in the network are included in $T^{W^{1}}$.

Let $\big (W^{1}_{(m+1)n-1},\cdots ,W^{1}_{1},W^{1}_{0}\big )$

$$\begin{aligned} =(\alpha ^{1}_{m},\cdots ,\alpha ^{n}_{m},\cdots \cdots ,\alpha ^{1}_{1},\cdots ,\alpha ^{n}_{1},\alpha ^{1}_{0},\cdots ,\alpha ^{n}_{0}), \end{aligned}$$

We have

$$\begin{aligned} T^{W^{1}}_{(k+1)n;:}=\big (\alpha (1:n)\big )^{T}. \end{aligned}$$

Similarly, let $\big (W^{i}_{(m+1)n-1},\cdots ,W^{i}_{1},W^{i}_{0}\big )$

$$\begin{aligned} =\big (\alpha ^{(i-1)n+1}_{m},\cdots ,\alpha ^{in}_{m},\cdots \cdots ,\alpha ^{(i-1)n+1}_{0},\cdots ,\alpha ^{in}_{0}\big ), \end{aligned}$$

we can get

$$\begin{aligned} T^{W^{i}}_{(k+1)n;:}=\big (\alpha _{k}\big ((i-1)n+1:in\big )\big )^{T} \end{aligned}$$

and

$$\begin{aligned} T_{(k+1)n;:}=\big (T^{W^{1}}_{(k+1)n;:},\cdots ,T^{W^{d}}_{(k+1)n;:}\big )=(\alpha _{k})^{T}, \end{aligned}$$

$$\begin{aligned} k=0,1,2,\cdots ,m. \end{aligned}$$

Since $W^{i}=w^{(J)}*\cdots w^{(2)}*w^{(1),i}$ , using the generating function of a sequence and the basic theorem of algebra, we can decompose $W^{(i)}$ into a finitely supported sequence $\{w^{(j)}\}^{J}_{j=1}$ . So far, we have decomposed the coefficients $\{\alpha _{0},\alpha _{1},\cdots ,\alpha _{m}\}$ in the ramp ridge function $F_{m}(x)$ into the weight parameters of each layer of this particular network.

Note that when we decompose

$$\begin{aligned} W^{i}=w^{(J)}*\cdots w^{(2)}*w^{(1),i},i=1,2,\cdots ,d \end{aligned}$$

a common factor of $\widetilde{w^{(J)}}(z)\cdots \widetilde{w^{(2)}}(z)$ is required. When a untrivial common factor exists, we can decompose it into a sequence $\{w^{(j)}\}^{J}_{j=2}$ with finite support.

If nonlinear activation is considered only in the J-th layer, let

$$\begin{aligned} b^{(J)}_{l}=\left\{ \begin{array}{ll} -B^{(J)}, &{} l=n\\ t_{k}, &{} l=(k+1)n, 1\le k \le m\\ B^{(J)}, &{} \text {otherwise} \end{array} \right. , \end{aligned}$$

what is different from the case in one-dimensional input is

$$\begin{aligned} B^{(0)}=\max _{x\in \varOmega }\max _{k=1,2,\cdots ,dn}|x_{k}|,\varOmega \in [-1,1]^{d \times n}, \end{aligned}$$

$$\begin{aligned} B^{(1)}=\big ( \Vert w^{(1),1}\Vert _{1},+\cdots +\Vert w^{(1),d}\Vert _{1}\big )B^{(0)}. \end{aligned}$$

We have

$$\begin{aligned} h^{(J)}_{l}= & {} \sigma (T\cdot x-b^{(J)})_{l}\\= & {} \left\{ \begin{array}{ll} \alpha _{0}x+B^{(J)}, &{} l=n\\ (\alpha _{k}-t_{k})_{+}, &{} l=(k+1)n, 1\le k \le m\\ 0, &{} \text {otherwise} \end{array} \right. , \end{aligned}$$

let $w^{(J+1)}=\{w^{J+1}_{0}\}=\{1\}$ and

$$\begin{aligned} b^{(J+1)}_{l}=\left\{ \begin{array}{ll} -B^{(J)}, &{} l=1\\ 0, &{} \text {otherwise} \end{array} \right. , \end{aligned}$$

then $T^{w^{(J+1)}}=I_{d_{J}}$ and

$$\begin{aligned} h^{(J+1)}_{l}= & {} \sigma (I\cdot h^{(J)}-b^{(J+1)})_{l}\\= & {} \left\{ \begin{array}{ll} B^{(1)}, &{} l=1\\ \alpha _{0}x+B^{(J)}, &{} l=n\\ (\alpha _{k}-t_{k})_{+}, &{} l=(k+1)n, 1\le k \le m\\ 0, &{} \text {otherwise} \end{array} \right. . \end{aligned}$$

For a set of $\mathbf {w}$, $\mathbf {b}$ constructed above, there exists $f^{\mathbf {w,b}}_{J+1} \in \mathcal {H}^{\mathbf {w,b}}_{J+1}$ , such that

$$\begin{aligned} \parallel f-f^{\mathbf {w,b}}_{J+1} \parallel _{C(\varOmega )} \le c_{0}v_{f,2}max\{\sqrt{log\,m},\sqrt{nd}\}m^{-\frac{1}{2}-\frac{1}{nd}}. \end{aligned}$$

If there is no untrivial common factor, we consider single convolutional layer, at this time, $w^{(1),i}=W^{i}$ is a sequence of length $(m+1)d$, then $s=(m+1)n-1$. In other words, this is a special case when $J=1$, that is, there exists $f^{\mathbf {w,b}}_{2} \in \mathcal {H}^{\mathbf {w,b}}_{2}$, such that

$$\begin{aligned} \parallel f-f^{\mathbf {w,b}}_{2} \parallel _{C(\varOmega )} \le c_{0}v_{f,2}max\{\sqrt{log\,m},\sqrt{nd}\}m^{-\frac{1}{2}-\frac{1}{nd}}. \end{aligned}$$

1.2 A.2 Model Training Results in the Trajectory Prediction Experiments

1.3 A.3Back Propagation Process of UniverApproCNN

Take the single-output CNN with two convolutional layers and a full connected layer considered above as an example, find the derivative of the loss function L with respect to $w^{(1)}_{k}$, where $L=(Y-y)^{2}$, and $y=\sum _{l=1}^{d_{2}}c_{l}h^{(2)}_{l}$. The derivative of loss function $L=(Y-y)^{2}$ with respect to $w^{(1)}_{k}$ is

$$\begin{aligned} \frac{\partial L}{\partial w^{(1)}_{k}}=\frac{\partial L}{\partial y} \cdot \sum _{l=1}^{d_{2}} \frac{\partial y}{\partial h^{(2)}_{l}}\cdot \frac{\partial h^{(2)}_{l}}{\partial w^{(1)}_{k}}, \end{aligned}$$

For fixed $0 \le k\le s$,

Define $a^{(j)}=T^{w^{(j)}}\cdots T^{w^{(1)}}x$, then essentially we find $\frac{\partial a^{(2)}_{l} }{w^{(1)}_{k}}$ and $\frac{\partial B^{(2)}}{\partial w^{(1)}_{k}}$. For fixed $0 \le k\le s$, observe that when $k+1 \le i \le k+d$, $a^{(1)}_{i}$ contains items related to $w^{(1)}_{k}$, and $\frac{\partial a^{(1)}_{i} }{\partial w^{(1)}_{k}}=x_{i-k}$.

Further, when $k+1 \le l \le k+d+s$, $a^{(2)}_{l}$ contains items related to $w^{(1)}_{k}$, and we only need to consider $\sum _{i=k+1}^{k+d} w^{(2)}_{l-i}a^{(1)}_{i}$ in $a^{(2)}_{l}$. Since $w^{(2)}$ is finite supported, we only consider $\sum _{i=max\{k+1,l-s\}}^{min\{k+d,l\}} w^{(2)}_{l-i}a^{(1)}_{i}$, then $\frac{\partial a^{(2)}_{l} }{\partial a^{(1)}_{i}}=w^{(2)}_{l-i}$ and

$$\begin{aligned} \frac{\partial a^{(2)}_{l} }{w^{(1)}_{k}}= & {} \sum _{i=max\{k+1,l-s\}}^{min\{k+d,l\}} \frac{\partial a^{(2)}_{l} }{\partial a^{(1)}_{i}} \cdot \frac{\partial a^{(1)}_{i} }{\partial w^{(1)}_{k}}\\= & {} \sum _{i=max\{k+1,l-s\}}^{min\{k+d,l\}} w^{(2)}_{l-i} \cdot x_{i-k}. \end{aligned}$$

It is easy to find that

$$\begin{aligned} \frac{\partial B^{(2)}}{\partial w^{(1)}_{k}}=\sum _{l=1}^{d_{2}}c_{l}\cdot \Vert w^{(2)}\Vert _{1}\cdot (| w^{(1)}_{k}|)^{'} \cdot B^{(0)} \end{aligned}$$

Similar conclusions can be obtained for the CNN with depth $J+1$ and single output.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Y., Wang, Y., Yang, S. (2021). A UniverApproCNN with Universal Approximation and Explicit Training Strategy. In: Gao, H., Wang, X. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2021. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 407. Springer, Cham. https://doi.org/10.1007/978-3-030-92638-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-92638-0_18
Published: 01 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92637-3
Online ISBN: 978-3-030-92638-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A UniverApproCNN with Universal Approximation and Explicit Training Strategy