Kernel density estimation by stagewise algorithm with a simple dictionary | Computational Statistics Skip to main content
Log in

Kernel density estimation by stagewise algorithm with a simple dictionary

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

This study proposes multivariate kernel density estimation by stagewise minimization algorithm based on U-divergence and a simple dictionary. The dictionary consists of an appropriate scalar bandwidth matrix and a part of the original data. The resulting estimator brings us data-adaptive weighting parameters and bandwidth matrices, and provides a sparse representation of kernel density estimation. We develop the non-asymptotic error bound of the estimator that we obtained via the proposed stagewise minimization algorithm. It is confirmed from simulation studies that the proposed estimator performs as well as, or sometimes better than, other well-known density estimators.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Basu A, Harris IR, Hjort NL, Jones MC (1998) Robust and efficient estimation by minimising a density power divergence. Biometrika 85:549–559

    Article  MathSciNet  Google Scholar 

  • Bregman LM (1967) The relaxition method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput Math Math Phys 15(1):17–30

    ADS  Google Scholar 

  • Dukkipati A (2010) On Kolmogorov-Nagumo averages and nonextensive entropy, 2010 International symposium on information theory and its applications, ISITA

  • Duong T, Hazelton ML (2003) Plug-in bandwidth matrices for bivariate kernel density estimation. J Nonparametric Stat 15(1):17–30

    Article  MathSciNet  Google Scholar 

  • Friedman JH, Stuetzle W, Schroeder A (1984) Projection pursuit density estimation. J Am Stat Assoc 79(387):599–608

    Article  MathSciNet  Google Scholar 

  • Girolami M, He C (2003) Probability density estimation from optimally condensed data samples. IEEE Trans Pattern Anal Mach Intell 25:1253–1264

    Article  Google Scholar 

  • Klemelä J (2007) Density estimation with stagewise optimization of the empirical risk. Mach Learn 90:29–57

    Google Scholar 

  • Murata N, Takenouchi T, Kanamori T, Eguchi S (2004) Information geometry of U-boost and Bregman divergence. Neural Comput 16:1437–1481

    Article  PubMed  Google Scholar 

  • Minami M, Eguchi S (2002) Robust blind source separation by beta-divergence. Neural Comput 14:1859–1886

    Article  Google Scholar 

  • Naito K, Eguchi S (2013) Density estimation with minimization of \(U\)-divergence. Mach Learn 90(1):29–57

    Article  MathSciNet  Google Scholar 

  • Nash JW, Sellers LT, Talbot RS, Cawthorn JA, Ford BW (1994) The population biology of abalone (haliotis species) in Tasmania. I. blacklip abalone (H. rubra) from the north coast and islands of bass Strait, Sea fisheries division, technical report No. 48 (ISSN 1034-3288)

  • Ridgeway G (2002) Looking for lumps: boosting and bagging for density estimation. Comput Stat Data Anal 38(1):379–392

    Article  MathSciNet  Google Scholar 

  • Scott D (2015) Multivariate density estimation, Theory, practice, and visualization, 2nd edn. Wiley

    Book  Google Scholar 

  • Schott JR (2017) Matrix analysis for statistics, 3rd edn. Wiley

    Google Scholar 

  • Wand MP, Jones MC (1993) Comparison of somoothing parametrizations in bivariate density estimation. J Am Stat Assoc 88:520–528

    Article  Google Scholar 

  • Zhang C, Jiang Y, Chai Y (2010) Penalized bregman divergence for large-dimensional regression and classification. Biometrika 97:551–566

    Article  MathSciNet  PubMed  PubMed Central  Google Scholar 

  • Zhang C, Jiang Y, Shang Z (2009) New aspects of bregman divergence in regression and classification with parametric and nonparametric estimation. Can J Stat 37:119–139

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

Kiheiji NISHIDA is grateful for the financial support from Hyogo Medical University Grant for Research Promotion, 2022. Kanta NAITO gratefully acknowledges the financial support from KAKENHI 19K11851. We sincerely thank the editor and two anonymous reviewers for taking the time to review our manuscript and providing many constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kiheiji Nishida.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

To prove Theorem 1, we need the following Lemmas 15. Let \({\tilde{f}}({\textbf{x}})\) be any density estimator and let

$$\begin{aligned} f^{*}({\textbf{x}}|{\textbf{X}}^{*}) = u\left( \sum _{j=1}^{N}p_{j}\xi (\phi _{j}({\textbf{x}}|{\textbf{X}}^{*})) \right) , \end{aligned}$$
(19)

where \(p_{j} \ge 0\), \(\sum _{j=1}^{N} p_{j} = 1\), and \(\phi _{j}({\textbf{x}}|{\textbf{X}}^{*}) \in D\). We consider two functions of the variable \(\pi \in [0, 1]\) defined as follows:

$$\begin{aligned}{} & {} \theta (\pi | {\tilde{f}}({\textbf{x}}), f^{*}({\textbf{x}}|{\textbf{X}}^{*})) \\{} & {} \quad = \sum _{j=1}^{N} p_{j} \int _{{\mathbb {R}}^{d}}[ U((1-\pi )\xi ({\tilde{f}}({\textbf{x}})) + \pi \xi (\phi _{j}({\textbf{x}}|{\textbf{X}}^{*}))) \\{} & {} \qquad - (1-\pi )U(\xi ({\tilde{f}}({\textbf{x}}))) - \pi U(\xi (f^{*}({\textbf{x}}|{\textbf{X}}^{*}))) ]d{\textbf{x}}. \\{} & {} \qquad \quad \eta (\pi | {\tilde{f}}({\textbf{x}}), f^{*}({\textbf{x}}|{\textbf{X}}^{*}))\\{} & {} \quad = \sum _{j=1}^{N} p_{j} \int _{{\mathbb {R}}^{d}}[ U((1-\pi )\xi ({\tilde{f}}({\textbf{x}})) + \pi \xi (\phi _{j}({\textbf{x}}|{\textbf{X}}^{*}))) \\{} & {} \qquad - U((1-\pi ) \xi ({\tilde{f}}({\textbf{x}})) + \pi \xi (f^{*}({\textbf{x}}|{\textbf{X}}^{*}))) ]d{\textbf{x}}. \end{aligned}$$

Then, Lemmas 13 are obtained as follows:

Lemma 1

Let \({\tilde{f}}({\textbf{x}})\) be a density estimator of \(f({\textbf{x}})\) and let \(f^{*}({\textbf{x}}|{\textbf{X}}^{*})\) be the estimator given in (19). For \(0 \le \pi \le 1\), it then follows that

$$\begin{aligned}{} & {} \sum _{j=1}^{N}p_{j} {\widehat{L}}_{U}(u((1-\pi )\xi ({\tilde{f}}(\cdot ))+\pi \xi (\phi _{j}(\cdot |{\textbf{X}}^{*})))) - {\widehat{L}}_{U}(f^{*}(\cdot |{\textbf{X}}^{*}))\\{} & {} \quad = (1-\pi )[{\widehat{L}}_{U}({\tilde{f}}(\cdot )) - {\widehat{L}}_{U}(f^{*}(\cdot |{\textbf{X}}^{*}))] + \theta (\pi |{\tilde{f}}({\textbf{x}}), f^{*}({\textbf{x}}|{\textbf{X}}^{*})). \end{aligned}$$

Lemma 2

Let \({\tilde{f}}({\textbf{x}})\) be a density estimator of \(f({\textbf{x}})\) and let \(f^{*}({\textbf{x}}|{\textbf{X}}^{*})\) be the estimator given in (19). Then for any \(\pi \in [0, 1]\),

$$\begin{aligned} \theta (\pi |{\tilde{f}}({\textbf{x}}), f^{*}({\textbf{x}}|{\textbf{X}}^{*}))\le & {} \eta (\pi |{\tilde{f}}({\textbf{x}}), f^{*}({\textbf{x}}|{\textbf{X}}^{*})) \end{aligned}$$

Lemma 3

Let \(f^{*}({\textbf{x}}|{\textbf{X}}^{*})\) be as given in (19) and let

$$\begin{aligned} {\tilde{f}}({\textbf{x}}|{\textbf{X}}^{*})= & {} u \left( \sum _{l=1}^{k} q_{l} \xi ({\tilde{\phi }}_{l}({\textbf{x}}|{\textbf{X}}^{*})) \right) \end{aligned}$$

for some \(\sum _{l=1}^{k} q_{l} \xi (\tilde{\phi _{l}}(\cdot |{\textbf{X}}^{*})) \in co(\xi (D))\). Under Assumption 1, it follows that

$$\begin{aligned} \eta (\pi ) = \eta (\pi |{\tilde{f}}({\textbf{x}}|{\textbf{X}}^{*}), f^{*}({\textbf{x}}|{\textbf{X}}^{*})) \le \pi ^{2} B_{U}({\textbf{X}}^{*})^{2} \end{aligned}$$

for any \(\pi \in [0,1]\).

In Lemmas 4-5, let \(W = \{ (\lambda _{\phi })_{\phi \in D} | \lambda _{\phi } \ge 0, \sum _{\phi \in D}\lambda _{\phi }=1, \#\{ \lambda _{\phi } > 0 \} < \infty \}\) and we define the following convex combination:

$$\begin{aligned} f({\textbf{x}}, \Lambda ) = f({\textbf{x}}, \Lambda , D) = u \Bigl ( \sum _{\phi \in D} \lambda _{\phi }\xi (\phi ({\textbf{x}}|{\textbf{X}}^{*})) \Bigr ), {\textbf{x}} \in {\mathbb {R}}^{d}, \end{aligned}$$

where \(\Lambda = (\lambda _{\phi })_{\phi \in D} \in W\). Then, we obtain Lemmas 4-5.

Lemma 4

For stagewise minimization density estimator \({\widehat{f}}({\textbf{x}}|{\textbf{X}}^{*})\), it holds under Assumption 1 that

$$\begin{aligned} {\widehat{L}}_{U}({\widehat{f}}(\cdot |{\textbf{X}}^{*}))\le & {} \inf _{\Lambda \in W} {\widehat{L}}_{U}({f}(\cdot , \Lambda )) + \frac{\theta ^2 B_{U}({\textbf{X}}^{*})^{2}}{M + (\theta -1)} + \delta . \end{aligned}$$

Lemma 5

For \(\tau > 0\), let \({\widehat{f}}(\cdot |{\textbf{X}}^{*}) \in \{ f(\cdot . \Lambda ) | \Lambda \in W\}\) be such that

$$\begin{aligned} {\widehat{L}}_{U}({\widehat{f}}(\cdot |{\textbf{X}}^{*})) \le \inf _{\Lambda \in W} {\widehat{L}}_{U}({\widehat{f}}(\cdot , \Lambda )) + \tau . \end{aligned}$$

Then,

$$\begin{aligned}{} & {} D_{U}(f, {\widehat{f}}(\cdot |{\textbf{X}}^{*})) \le \inf _{\Lambda \in W} {\widehat{L}}_{U}({\widehat{f}}(\cdot , \Lambda )) + \tau + 2 \sup _{\phi \in D} |\nu _{n}(\xi (\phi (\cdot |{\textbf{X}}^{*})))|,\\{} & {} \quad where \nu _{n}(\xi (\phi (\cdot |{\textbf{X}}^{*}))) = \frac{1}{n} \sum _{i=1}^{n} \xi (\phi ({\textbf{X}}_{i}|{\textbf{X}}^{*})) - \int _{{\mathbb {R}}^{d}} \xi (\phi ({\textbf{x}}|{\textbf{X}}^{*}))f({\textbf{x}})d{\textbf{x}}. \end{aligned}$$

If we replace the dictionary in the proofs of Klemelä (2007) and Naito and Eguchi (2013) with the one in (9), we obtain Lemmas 15. Using Lemmas 4-5 in conjunction with Lemmas 13, we obtain the non-asymptotic error bound given by Theorem 1.

Appendix B

We obtain Theorem 2 using Lemma 6.

Lemma 6

For any \(h({\textbf{x}}|{\textbf{X}}^{*}) \in co(\xi (D))\), let \(g({\textbf{x}}|{\textbf{X}}^{*}) = u(h({\textbf{x}}|{\textbf{X}}^{*}))\) and let \(g_{c}({\textbf{x}}|{\textbf{X}}^{*}) = v_{g}^{-1}g({\textbf{x}}|{\textbf{X}}^{*})\), where

$$\begin{aligned} v_{g} = v_{g}({\textbf{X}}^{*}) = \int _{{\mathbb {R}}^{d}}g({\textbf{x}}|{\textbf{X}}^{*})d{\textbf{x}}. \end{aligned}$$

Under Assumption 2, we have

$$\begin{aligned}{} & {} D_{U}(f(\cdot ), g_{c}(\cdot |{\textbf{X}}^{*}))\\{} & {} \quad \le D_{U}(f(\cdot ), g(\cdot |{\textbf{X}}^{*})) + C_{U}^{-1} \bigl |1-v_{g}({\textbf{X}}^{*})^{-1} \bigr | \int _{{\mathbb {R}}^{d}} \bigl |g_{c}({\textbf{x}}|{\textbf{X}}^{*}) -f({\textbf{x}})\bigr | g({\textbf{x}}|{\textbf{X}}^{*})^{1-\alpha } d{\textbf{x}}. \end{aligned}$$

If we replace the dictionary in the proofs of Klemelä (2007) and Naito and Eguchi (2013) with the one in (9), we obtain Lemma 6.

Appendix C

Using convex property of \(U(t) = \exp (t)\), we can verify for the KL divergence that

$$\begin{aligned}{} & {} \Psi (\delta , \Phi |{\textbf{X}}^{*}) \\{} & {} \quad = \int _{{\mathbb {R}}^{d}} \exp \left( (1-\delta ) \sum _{m=1}^{T}q_{m} \log {\tilde{\phi }}_{m}({\textbf{x}}|{\textbf{X}}^{*}) + \delta \log \phi ({\textbf{x}}|{\textbf{X}}^{*}) \right) \\{} & {} \qquad \quad \{ \log \phi ({\textbf{x}}|{\textbf{X}}^{*}) - \log {\bar{\phi }}({\textbf{x}}|{\textbf{X}}^{*}) \}^{2} d{\textbf{x}} \\{} & {} \quad \le \int _{{\mathbb {R}}^{d}} \left( (1-\delta ) \sum _{m=1}^{T}q_{m} {\tilde{\phi }}_{m}({\textbf{x}}|{\textbf{X}}^{*}) + \delta \phi ({\textbf{x}}|{\textbf{X}}^{*}) \right) \{ \log \phi ({\textbf{x}}|{\textbf{X}}^{*}) - \log {\bar{\phi }}({\textbf{x}}|{\textbf{X}}^{*}) \}^{2} d{\textbf{x}} \\{} & {} \quad \le (1-\delta ) \sum _{m=1}^{T}q_{m} \int _{{\mathbb {R}}^{d}} {\tilde{\phi }}({\textbf{x}}|{\textbf{X}}^{*}) \{ \log \phi ({\textbf{x}}|{\textbf{X}}^{*}) - \log {\bar{\phi }}({\textbf{x}}|{\textbf{X}}^{*}) \}^{2} d{\textbf{x}} \\{} & {} \qquad + \delta \int _{{\mathbb {R}}^{d}} \phi ({\textbf{x}}|{\textbf{X}}^{*}) \{ \log \phi ({\textbf{x}}|{\textbf{X}}^{*}) - \log {\bar{\phi }}({\textbf{x}}|{\textbf{X}}^{*}) \}^{2} d{\textbf{x}} \\{} & {} \quad \le (1-\delta )\sum _{m=1}^{T}q_{m}B_{KL}({\textbf{X}}^{*})^{2} + \delta B_{KL}({\textbf{X}}^{*})^{2} \\{} & {} \quad = B_{KL}({\textbf{X}}^{*})^{2}. \end{aligned}$$

We evaluate the constant \(B_{KL}({\textbf{X}}^{*})^{2}\). Let \(h_{a}\), \(h_{b}\) and \(h_{c}\) be three different scalar bandwidths, which are not random variables by assumption. Let \({\textbf{X}}_{i}^{*}\), \({\textbf{X}}_{j}^{*}\) and \({\textbf{X}}_{k}^{*}\) respectively be the means of the words. In what follows, we denote the density of the d-dimensional multivariate normal distribution \(N_{d}({\textbf{X}}_{i}^{*}, h_{a}^{2}{\textbf{I}})\) to be \(\phi _{a}({\textbf{x}}|{\textbf{X}}_{i}^{*})\). We also denote the density of the d-dimensional standard normal distribution to be \(\phi (\cdot )\). For notational convenience, we define \(h_{ab} \equiv h_{a}/h_{b}\) and \(h_{ac} \equiv h_{a}/h_{c}\). Then, we obtain

$$\begin{aligned} J= & {} \int _{{\mathbb {R}}^{d}} \phi _{a}({\textbf{x}}|{\textbf{X}}_{i}^{*}) \{ \log {\phi _{b}({\textbf{x}}|{\textbf{X}}_{j}^{*})} - \log {\phi _{c}({\textbf{x}}|{\textbf{X}}_{k}^{*})} \}^{2} d{\textbf{x}} \\= & {} \int _{{\mathbb {R}}^{d}} \phi _{a}({\textbf{x}}|{\textbf{X}}_{i}^{*}) \Bigl \{ \frac{1}{2h_{c}^{2}} \Vert {\textbf{x}} - {\textbf{X}}_{k}^{*} \Vert ^{2} - \frac{1}{2h_{b}^{2}}\Vert {\textbf{x}} - {\textbf{X}}_{j}^{*} \Vert ^{2} + d\log {h_{cb}} \Bigr \}^{2} d{\textbf{x}} \\= & {} \int _{{\mathbb {R}}^{d}} \phi _{a}({\textbf{x}}|{\textbf{X}}_{i}^{*}) \Biggl [ \frac{h_{ac}^{2} - h_{ab}^{2}}{2} \cdot \frac{\Vert {\textbf{x}} - {\textbf{X}}_{i}^{*} \Vert ^{2}}{h_{a}^{2}} \\{} & {} + \frac{({\textbf{x}}-{\textbf{X}}_{i}^{*})^{T}}{h_{a}} \Biggl \{ h_{ac}^{2} \frac{({\textbf{X}}_{i}^{*} - {\textbf{X}}_{k}^{*})}{h_{a}} - h_{ab}^{2} \frac{({\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*})}{h_{a}} \Biggr \} \\{} & {} + \frac{\Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{k}^{*} \Vert ^{2}}{2h_{c}^{2}} - \frac{\Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*} \Vert ^{2}}{2h_{b}^{2}} + d \log h_{cb} \Biggr ]^{2} d{\textbf{x}} \\= & {} \int _{{\mathbb {R}}^{d}} \phi ({\textbf{t}}) \Bigl \{ C_{1}(a, b, c) \Vert {\textbf{t}} \Vert ^{2} + {\textbf{t}}^{T} {\textbf{C}}_{2}(a, b, c; i,j,k) + C_{3}(b, c; i,j,k) \Bigr \}^{2} d{\textbf{t}}, \end{aligned}$$

where we obtain the last equation by change of variable \({\textbf{t}} = {\textbf{x}} - {\textbf{X}}_{i}^{*}\) and define to be

$$\begin{aligned} C_{1}\equiv & {} C_{1}(a, b, c) = \frac{h_{ac}^{2} - h_{ab}^{2}}{2} \ \ \ \in {\mathbb {R}}, \\ {\textbf{C}}_{2}\equiv & {} {\textbf{C}}_{2}(a, b, c; i,j,k) = h_{ac}^{2} \frac{({\textbf{X}}_{i}^{*} - {\textbf{X}}_{k}^{*})}{h_{a}} - h_{ab}^{2} \frac{({\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*})}{h_{a}} \ \ \ \in {\mathbb {R}}^{d}, \\ C_{3}\equiv & {} {C}_{3}(b, c; i,j,k) = \frac{\Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{k}^{*} \Vert ^{2}}{2h_{c}^{2}} - \frac{\Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*} \Vert ^{2}}{2h_{b}^{2}} + d \log h_{cb} \ \ \ \in {\mathbb {R}}. \end{aligned}$$

The symbol \(\Vert {\textbf{x}} \Vert\) means \(({\textbf{x}}^{T}{\textbf{x}})^{1/2}\). Using the fact that the odd order moments of normal distribution are zero, and Theorem 11.22 in Schott (2017, p.480), we obtain

$$\begin{aligned} J= & {} \int _{{\mathbb {R}}^{d}} \phi ({\textbf{t}}) \Bigl \{ C_{1}^2 \Vert {\textbf{t}} \Vert ^{4} + ({\textbf{C}}_{2}^{T}{\textbf{t}})({\textbf{t}}^{T} {\textbf{C}}_{2}) + C_{3}^{2} + 2 C_{1} \Vert {\textbf{t}} \Vert ^{2}({\textbf{t}}^{T} {\textbf{C}}_{2}) \nonumber \\{} & {} + 2C_{3}{\textbf{t}}^{T}{\textbf{C}}_{2} + 2C_{1}C_{3} \Vert {\textbf{t}}\Vert ^{2} \Bigr \} d{\textbf{t}} \nonumber \\= & {} d(d+2) C_{1}^{2} + \Vert {\textbf{C}}_{2} \Vert ^{2} + 2d C_{1}C_{3} + C_{3}^{2} \nonumber \\= & {} 2d C_{1}^{2} + \Vert {\textbf{C}}_{2} \Vert ^{2} + (d C_{1} + C_{3})^{2}. \end{aligned}$$
(20)

We define

$$\begin{aligned} h_{R}= & {} \frac{h_{max}}{h_{min}}, \end{aligned}$$

where \(h_{min}\) and \(h_{max}\) are the minimum and the maximum bandwidths in the dictionary, respectively. We also define

$$\begin{aligned} R^{2}= & {} \max _{i \ne j} \{ \Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*} \Vert ^{2} \}. \end{aligned}$$

Then, we obtain

$$\begin{aligned} |C_{1}|= & {} \Biggl | \frac{h_{ac}^{2}-h_{ab}^{2}}{2} \Biggr | \nonumber \\\le & {} \frac{h_{ac}^{2}+h_{ab}^{2}}{2} \nonumber \\\le & {} \frac{h_{R}^{2}+h_{R}^{2}}{2} \nonumber \\= & {} h_{R}^{2}, \end{aligned}$$
(21)
$$\begin{aligned} \Vert {\textbf{C}}_{2}\Vert ^{2}= & {} \Biggl \Vert h_{ac}^{2} \frac{({\textbf{X}}_{i}^{*} - {\textbf{X}}_{k}^{*})}{h_{a}} - h_{ab}^{2} \frac{({\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*})}{h_{a}} \Biggr \Vert ^{2} \nonumber \\\le & {} \frac{h_{ac}^{4}}{h_{a}^{2}} \Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{k}^{*}\Vert ^{2} + \frac{h_{ab}^{4}}{h_{a}^{2}} \Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*}\Vert ^{2} + 2 \frac{h_{ac}^{2}h_{ab}^{2}}{h_{a}^{2}} \Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{k}^{*}\Vert \cdot \Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*} \Vert \nonumber \\\le & {} \frac{h_{ac}^{4}}{h_{a}^{2}}R^{2} + \frac{h_{ab}^{4}}{h_{a}^{2}}R^{2} + 2 \frac{h_{ac}^{2}h_{ab}^{2}}{h_{a}^{2}} R^{2} \nonumber \\= & {} \Biggl ( \frac{h_{ac}^{2} + h_{ab}^{2}}{h_{a}} \Biggr )^{2} R^{2} \nonumber \\\le & {} 4 \Biggl ( \frac{h_{R}^{2}}{h_{min}} \Biggr )^{2} R^{2}, \end{aligned}$$
(22)

and

$$\begin{aligned} |C_{3}|= & {} \Biggl | \frac{\Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{k}^{*}\Vert ^{2}}{2h_{c}^{2}} - \frac{\Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*}\Vert ^{2}}{2h_{b}^{2}} + d \log h_{cb} \Biggr | \nonumber \\\le & {} \frac{1}{2h_{min}^{2}} \Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{k}^{*}\Vert ^{2} + \frac{1}{2h_{min}^{2}} \Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*}\Vert ^{2} + d \log h_{R} \nonumber \\\le & {} \frac{1}{2h_{min}^{2}} R^{2} + \frac{1}{2h_{min}^{2}} R^{2} + d \log h_{R} \nonumber \\= & {} \frac{R^{2}}{h_{min}^{2}} + d \log h_{R}. \end{aligned}$$
(23)

Therefore, using (21), (22), and (23) in (20), we obtain the upper bound

$$\begin{aligned} J\le & {} B_{KL}({\textbf{X}}^{*})^{2} \nonumber \\\le & {} 2d h_{R}^{4} + 4 \Biggl ( \frac{h_{R}^{2}}{h_{min}} \Biggr )^{2} R^{2} + \Biggl \{ dh_{R}^{2} + \frac{R^{2}}{h_{min}^{2}} + d \log h_{R} \Biggr \}^2. \end{aligned}$$
(24)

\(\square\)

Appendix D

We prove that the finiteness of \(E_{{\textbf{X}}^{*}}[B_{U}({\textbf{X}}^{*})^{2}]\) in (15) is ensured if the fourth moment of \({\textbf{X}}_{i}^{*}\) is assumed in the case of KL divergence as described in Remark 3. Considering the expectation of the right-hand side of (24), we need to evaluate

$$\begin{aligned}{} & {} E_{{\textbf{X}}^{*}}\Bigl [ (R^{2})^2 \Bigr ] \\{} & {} \quad = E_{{\textbf{X}}^{*}} \Bigl [ \max _{i \ne j} (\Vert {\textbf{X}}_{i}^{*} - {\textbf{X}}_{j}^{*}\Vert ^{2})^2 \Bigr ] \\{} & {} \quad \le E_{{\textbf{X}}^{*}} \Bigl [ \max _{i \ne j} (\Vert {\textbf{X}}_{i}^{*} \Vert ^{2} + \Vert {\textbf{X}}_{j}^{*} \Vert ^{2} + 2 |{\textbf{X}}_{i}^{*T} {\textbf{X}}_{j}^{*}| )^2 \Bigr ] \\{} & {} \quad \le E_{{\textbf{X}}^{*}} \Bigl [ (\max _{i} \Vert {\textbf{X}}_{i}^{*} \Vert ^{2} + \max _{j} \Vert {\textbf{X}}_{j}^{*} \Vert ^{2} + 2 \max _{i} \Vert {\textbf{X}}_{i}^{*}\Vert \max _{j} \Vert {\textbf{X}}_{j}^{*}\Vert )^2 \Bigr ] \\{} & {} \quad = 16 E_{{\textbf{X}}^{*}} \Bigl [\max _{i} \Vert {\textbf{X}}_{i}^{*} \Vert ^{4} \Bigr ]. \end{aligned}$$

Furthermore, we obtain

$$\begin{aligned}{} & {} E_{{\textbf{X}}^{*}} \Bigl [ \max _{i} \Vert {\textbf{X}}_{i}^{*} \Vert ^{4} \Bigr ]\\{} & {} \quad \le \sum _{i=1}^{m} E_{{\textbf{X}}^{*}} \Bigl [ \Vert {\textbf{X}}_{i}^{*} \Vert ^{4} \Bigr ] \\{} & {} \quad \le m \cdot \max _{i} E_{{\textbf{X}}^{*}} \Bigl [ \Vert {\textbf{X}}_{i}^{*} \Vert ^{4} \Bigr ]. \end{aligned}$$

Hence, it suffices to assume \(E_{{\textbf{X}}^{*}} \Bigl [ \Vert {\textbf{X}}_{i}^{*} \Vert ^{4} \Bigr ] < \infty\). \(\square\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nishida, K., Naito, K. Kernel density estimation by stagewise algorithm with a simple dictionary. Comput Stat 39, 523–560 (2024). https://doi.org/10.1007/s00180-022-01303-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-022-01303-7

Keywords

Navigation