A UniverApproCNN with Universal Approximation and Explicit Training Strategy | SpringerLink
Skip to main content

A UniverApproCNN with Universal Approximation and Explicit Training Strategy

  • Conference paper
  • First Online:
Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2021)

Abstract

Approximation theory has achieved many results in the field of deep learning. However, these results and conclusions can rarely be directly applied to the solution of practical problems or directly guide the training and optimization of deep learning models in the actual context. To address this issue, we construct a CNN structure with universal approximation, which is called UniverApproCNN. It is ensured that the approximation error of such CNN is bounded by an explicit approximation upper bound that relies on the hyper parameters of this model. Moreover, a general case of multidimensional is considered by generalizing the conclusion of the universality property of CNN. A practical problem in the field of inertial guidance is used as a background to conduct experiments, so that the theory can give an explicit training strategy and break the barrier between the theory and its application. We use the curve similarity index defined by Fréchet distance to prove that the experimental results are highly consistent with the functional relationship given by the theory. On this basis, we define the ‘approximation coefficient’ of UniverApproCNN, which can give the stop time of model training and related training strategies. Specifically, taking the operation of normalization as a widely used technique into consideration, we then show that this operation does not take effect on the approximation performance of the CNN and UniverApproCNN.

Supported by special fund for scientific and technological innovation strategy of Guangdong Province.

Y. Yang, Y. Wang and S. Yang—These authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2(4), 303–314 (1989)

    Article  MathSciNet  Google Scholar 

  2. Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–195 (1999)

    Article  MathSciNet  Google Scholar 

  3. Zhou, D.X., Jetter, K.: Approximation with polynomial kernels and SVM classifiers. Adv. Comput. Math. 25(1–3), 323–344 (2006)

    Article  MathSciNet  Google Scholar 

  4. Breiman, L.: Hinging hyperplanes for regression, classification, and function approximation. IEEE Trans. Inf. Theory 39(3), 999–1013 (1993)

    Article  MathSciNet  Google Scholar 

  5. Klusowski, J.M., Barron, A.R.: Approximation by combinations of ReLU and squared ReLU ridge functions with \(l^{1}\) and \(l^{0}\) controls (2016)

    Google Scholar 

  6. Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930–945 (1993)

    Google Scholar 

  7. Klusowski, J.M., Barron, A.R.: Uniform approximation by neural networks activated by first and second order ridge splines. IEEE Trans. Inf. Theory (2016)

    Google Scholar 

  8. Hornik, K.M., Stinchcomb, M., White, H.: Multilayer feedforward networks are universal approximator. IEEE Trans. Neural Networks 2, 01 (1989)

    Article  Google Scholar 

  9. Zhou, D.-X.: Deep distributed convolutional neural networks: Universality. Anal. Appl. 16(92), 895–919 (2018)

    Article  MathSciNet  Google Scholar 

  10. Mhaskar, H.N., Poggio, T.: Deep vs. shallow networks: an approximation theory perspective. Anal. Appl. 14(06), 829–848 (2016)

    Google Scholar 

  11. Mallat, S.: Understanding deep convolutional networks. Phil. Trans. R. Soc. A 374(2065), 20150203–20150203 (2016)

    Article  Google Scholar 

  12. Steinwart, I., Thomann, P., Schmid, N.: Learning with hierarchical gaussian kernels. arXiv Machine Learning (2016)

  13. Bruna, J., Mallat, S.: Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1872–1886 (2013)

    Article  Google Scholar 

  14. Zhou, D.: Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 48(2), 787–794 (2020)

    Article  MathSciNet  Google Scholar 

  15. Santurkar, S., Tsipras, D., Ilyas, A., Mądry, A.: How does batch normalization help optimization? In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 2488–2498 (2018)

    Google Scholar 

  16. Wang, Y., Wang, Y., Hu, G., Liu, Y., Zhao, Y.: Adaptive skewness kurtosis neural network: enabling communication between neural nodes within a layer, pp. 498–507 (2020)

    Google Scholar 

  17. Burger, M., Engl, H.W.: Training neural networks with noisy data as an Ill-posed problem. Adv. Comput. Math. 13(4), 335–354 (2000)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Proof of the Theorem 3

Proof

Because of the density of \(H^{r}(\varOmega )\) in \(C(\varOmega )\) when \(r \ge \frac{nd}{2}+2\), without loss of generality, we assume \(f \in H^{r}(\varOmega )\). When the activation function \(\sigma (u)=max \{u,0\}, u \in \mathbb {R}\) is not considered, the input-related coefficients in the network are all included in the convolution matrix

$$\begin{aligned} \big ( T^{w^{(J)}}\cdots T^{w^{(2)}}T^{w^{(1),1}},\cdots \cdots , T^{w^{(J)}}\cdots T^{w^{(2)}}T^{w^{(1),d}}\big ) , \end{aligned}$$

given the following notation:

$$\begin{aligned} \big ( T^{w^{(J)}}\cdots T^{w^{(2)}}T^{w^{(1),1}},\cdots \cdots , T^{w^{(J)}}\cdots T^{w^{(2)}}T^{w^{(1),d}}\big ) \end{aligned}$$
$$\begin{aligned} {\mathop {=}\limits ^{\bigtriangleup }}\big ( T^{W^{1}},\cdots \cdots , T^{W^{d}}\big ){\mathop {=}\limits ^{\bigtriangleup }}T \end{aligned}$$

where \(W^{i}=w^{(J)}*\cdots *w^{(2)}*w^{(1),i}\) and \(T^{W^{i}}\) is the Toeplitz type matrix induced by \(W^{i}\).

Note that in the ramp ridge function, the coefficients associated with the first sub-block x(1 : n) of length n of input x are

$$\begin{aligned} \{ \alpha ^{1}_{m},\cdots ,\alpha ^{n}_{m},\cdots \cdots ,\alpha ^{1}_{1},\cdots ,\alpha ^{n}_{1},\alpha ^{1}_{0},\cdots ,\alpha ^{n}_{0}\}, \end{aligned}$$

the coefficients related to x(1 : n) in the network are included in \(T^{W^{1}}\).

Let \(\big (W^{1}_{(m+1)n-1},\cdots ,W^{1}_{1},W^{1}_{0}\big )\)

$$\begin{aligned} =(\alpha ^{1}_{m},\cdots ,\alpha ^{n}_{m},\cdots \cdots ,\alpha ^{1}_{1},\cdots ,\alpha ^{n}_{1},\alpha ^{1}_{0},\cdots ,\alpha ^{n}_{0}), \end{aligned}$$

We have

$$\begin{aligned} T^{W^{1}}_{(k+1)n;:}=\big (\alpha (1:n)\big )^{T}. \end{aligned}$$

Similarly, let \(\big (W^{i}_{(m+1)n-1},\cdots ,W^{i}_{1},W^{i}_{0}\big )\)

$$\begin{aligned} =\big (\alpha ^{(i-1)n+1}_{m},\cdots ,\alpha ^{in}_{m},\cdots \cdots ,\alpha ^{(i-1)n+1}_{0},\cdots ,\alpha ^{in}_{0}\big ), \end{aligned}$$

we can get

$$\begin{aligned} T^{W^{i}}_{(k+1)n;:}=\big (\alpha _{k}\big ((i-1)n+1:in\big )\big )^{T} \end{aligned}$$

and

$$\begin{aligned} T_{(k+1)n;:}=\big (T^{W^{1}}_{(k+1)n;:},\cdots ,T^{W^{d}}_{(k+1)n;:}\big )=(\alpha _{k})^{T}, \end{aligned}$$
$$\begin{aligned} k=0,1,2,\cdots ,m. \end{aligned}$$

Since \(W^{i}=w^{(J)}*\cdots w^{(2)}*w^{(1),i}\) , using the generating function of a sequence and the basic theorem of algebra, we can decompose \(W^{(i)}\) into a finitely supported sequence \(\{w^{(j)}\}^{J}_{j=1}\) . So far, we have decomposed the coefficients \(\{\alpha _{0},\alpha _{1},\cdots ,\alpha _{m}\}\) in the ramp ridge function \(F_{m}(x)\) into the weight parameters of each layer of this particular network.

Note that when we decompose

$$\begin{aligned} W^{i}=w^{(J)}*\cdots w^{(2)}*w^{(1),i},i=1,2,\cdots ,d \end{aligned}$$

a common factor of \(\widetilde{w^{(J)}}(z)\cdots \widetilde{w^{(2)}}(z)\) is required. When a untrivial common factor exists, we can decompose it into a sequence \(\{w^{(j)}\}^{J}_{j=2}\) with finite support.

If nonlinear activation is considered only in the J-th layer, let

$$\begin{aligned} b^{(J)}_{l}=\left\{ \begin{array}{ll} -B^{(J)}, &{} l=n\\ t_{k}, &{} l=(k+1)n, 1\le k \le m\\ B^{(J)}, &{} \text {otherwise} \end{array} \right. , \end{aligned}$$

what is different from the case in one-dimensional input is

$$\begin{aligned} B^{(0)}=\max _{x\in \varOmega }\max _{k=1,2,\cdots ,dn}|x_{k}|,\varOmega \in [-1,1]^{d \times n}, \end{aligned}$$
$$\begin{aligned} B^{(1)}=\big ( \Vert w^{(1),1}\Vert _{1},+\cdots +\Vert w^{(1),d}\Vert _{1}\big )B^{(0)}. \end{aligned}$$

We have

$$\begin{aligned} h^{(J)}_{l}= & {} \sigma (T\cdot x-b^{(J)})_{l}\\= & {} \left\{ \begin{array}{ll} \alpha _{0}x+B^{(J)}, &{} l=n\\ (\alpha _{k}-t_{k})_{+}, &{} l=(k+1)n, 1\le k \le m\\ 0, &{} \text {otherwise} \end{array} \right. , \end{aligned}$$

let \(w^{(J+1)}=\{w^{J+1}_{0}\}=\{1\}\) and

$$\begin{aligned} b^{(J+1)}_{l}=\left\{ \begin{array}{ll} -B^{(J)}, &{} l=1\\ 0, &{} \text {otherwise} \end{array} \right. , \end{aligned}$$

then \(T^{w^{(J+1)}}=I_{d_{J}}\) and

$$\begin{aligned} h^{(J+1)}_{l}= & {} \sigma (I\cdot h^{(J)}-b^{(J+1)})_{l}\\= & {} \left\{ \begin{array}{ll} B^{(1)}, &{} l=1\\ \alpha _{0}x+B^{(J)}, &{} l=n\\ (\alpha _{k}-t_{k})_{+}, &{} l=(k+1)n, 1\le k \le m\\ 0, &{} \text {otherwise} \end{array} \right. . \end{aligned}$$

For a set of \(\mathbf {w}\), \(\mathbf {b}\) constructed above, there exists \(f^{\mathbf {w,b}}_{J+1} \in \mathcal {H}^{\mathbf {w,b}}_{J+1}\) , such that

$$\begin{aligned} \parallel f-f^{\mathbf {w,b}}_{J+1} \parallel _{C(\varOmega )} \le c_{0}v_{f,2}max\{\sqrt{log\,m},\sqrt{nd}\}m^{-\frac{1}{2}-\frac{1}{nd}}. \end{aligned}$$

If there is no untrivial common factor, we consider single convolutional layer, at this time, \(w^{(1),i}=W^{i}\) is a sequence of length \((m+1)d\), then \(s=(m+1)n-1\). In other words, this is a special case when \(J=1\), that is, there exists \(f^{\mathbf {w,b}}_{2} \in \mathcal {H}^{\mathbf {w,b}}_{2}\), such that

$$\begin{aligned} \parallel f-f^{\mathbf {w,b}}_{2} \parallel _{C(\varOmega )} \le c_{0}v_{f,2}max\{\sqrt{log\,m},\sqrt{nd}\}m^{-\frac{1}{2}-\frac{1}{nd}}. \end{aligned}$$

1.2 A.2 Model Training Results in the Trajectory Prediction Experiments

Fig. 4.
figure 4

The experiment on y-axis.

Fig. 5.
figure 5

The experiment on z-axis.

1.3 A.3Back Propagation Process of UniverApproCNN

Take the single-output CNN with two convolutional layers and a full connected layer considered above as an example, find the derivative of the loss function L with respect to \(w^{(1)}_{k}\), where \(L=(Y-y)^{2}\), and \(y=\sum _{l=1}^{d_{2}}c_{l}h^{(2)}_{l}\). The derivative of loss function \(L=(Y-y)^{2}\) with respect to \(w^{(1)}_{k}\) is

$$\begin{aligned} \frac{\partial L}{\partial w^{(1)}_{k}}=\frac{\partial L}{\partial y} \cdot \sum _{l=1}^{d_{2}} \frac{\partial y}{\partial h^{(2)}_{l}}\cdot \frac{\partial h^{(2)}_{l}}{\partial w^{(1)}_{k}}, \end{aligned}$$

For fixed \(0 \le k\le s\),

Define \(a^{(j)}=T^{w^{(j)}}\cdots T^{w^{(1)}}x\), then essentially we find \(\frac{\partial a^{(2)}_{l} }{w^{(1)}_{k}}\) and \(\frac{\partial B^{(2)}}{\partial w^{(1)}_{k}}\). For fixed \(0 \le k\le s\), observe that when \(k+1 \le i \le k+d\), \(a^{(1)}_{i}\) contains items related to \(w^{(1)}_{k}\), and \(\frac{\partial a^{(1)}_{i} }{\partial w^{(1)}_{k}}=x_{i-k}\).

Further, when \(k+1 \le l \le k+d+s\), \(a^{(2)}_{l}\) contains items related to \(w^{(1)}_{k}\), and we only need to consider \(\sum _{i=k+1}^{k+d} w^{(2)}_{l-i}a^{(1)}_{i}\) in \(a^{(2)}_{l}\). Since \(w^{(2)}\) is finite supported, we only consider \(\sum _{i=max\{k+1,l-s\}}^{min\{k+d,l\}} w^{(2)}_{l-i}a^{(1)}_{i}\), then \(\frac{\partial a^{(2)}_{l} }{\partial a^{(1)}_{i}}=w^{(2)}_{l-i}\) and

$$\begin{aligned} \frac{\partial a^{(2)}_{l} }{w^{(1)}_{k}}= & {} \sum _{i=max\{k+1,l-s\}}^{min\{k+d,l\}} \frac{\partial a^{(2)}_{l} }{\partial a^{(1)}_{i}} \cdot \frac{\partial a^{(1)}_{i} }{\partial w^{(1)}_{k}}\\= & {} \sum _{i=max\{k+1,l-s\}}^{min\{k+d,l\}} w^{(2)}_{l-i} \cdot x_{i-k}. \end{aligned}$$

It is easy to find that

$$\begin{aligned} \frac{\partial B^{(2)}}{\partial w^{(1)}_{k}}=\sum _{l=1}^{d_{2}}c_{l}\cdot \Vert w^{(2)}\Vert _{1}\cdot (| w^{(1)}_{k}|)^{'} \cdot B^{(0)} \end{aligned}$$

Similar conclusions can be obtained for the CNN with depth \(J+1\) and single output.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Y., Wang, Y., Yang, S. (2021). A UniverApproCNN with Universal Approximation and Explicit Training Strategy. In: Gao, H., Wang, X. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2021. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 407. Springer, Cham. https://doi.org/10.1007/978-3-030-92638-0_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92638-0_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92637-3

  • Online ISBN: 978-3-030-92638-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics