Abstract
This article introduces the R package hermiter which facilitates estimation of univariate and bivariate probability density functions and cumulative distribution functions along with full quantile functions (univariate) and nonparametric correlation coefficients (bivariate) using Hermite series based estimators. The algorithms implemented in the hermiter package are particularly useful in the sequential setting (both stationary and non-stationary) and one-pass batch estimation setting for large data sets. In addition, the Hermite series based estimators are approximately mergeable allowing parallel and distributed estimation.
Similar content being viewed by others
Notes
We would like to thank Ted Dunning for useful discussions in this regard.
This platform is an ECN i.e. an Electronic Communication Network.
References
Allaire J, Francois R, Ushey K, et al (2022) RcppParallel: parallel programming tools for Rcpp, 5.1.5. https://CRAN.R-project.org/package=RcppParallel. Accessed 15 Dec 2022
Bezanson J, Edelman A, Karpinski S et al (2017) Julia: a fresh approach to numerical computing. SIAM Rev 59(1):65–98. https://doi.org/10.1137/141000671
Boyd JP (2018) Dynamics of the equatorial ocean. Springer, Berlin, Heidelberg
Boyd JP, Moore DW (1986) Summability methods for Hermite functions. Dyn Atmos Oceans 10(1):51–62. https://doi.org/10.1016/0377-0265(86)90009-6
Chen Z, Zhang A (2020) A survey of approximate quantile computation on large-scale data. IEEE Access 8:34585–34597. https://doi.org/10.1109/ACCESS.2020.2974919
Chen F, Lambert D, Pinheiro JC (2000) Incremental quantile estimation for massive tracking. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 516–522, https://doi.org/10.1145/347090.347195
Christophe D, Petr S (2022) randtoolbox: generating and testing random numbers, 2.0.3. https://cran.r-project.org/package=randtoolbox. Accessed 15 Dec 2022
Croux C, Dehon C (2010) Influence functions of the Spearman and Kendall correlation measures. Stat Methods Appl 19(4):497–515. https://doi.org/10.1007/s10260-010-0142-z
Day J, Zhou H (2020) OnlineStats. jl: a Julia package for statistics on data streams. J Open Sour Softw 1:1. https://doi.org/10.21105/joss.01816
Devroye L, Gyorfi L (1985) Nonparametric density estimation: the L1 view. Wiley, New York
Dunning T (2021) The t-digest: efficient estimates of distributions. Softw Impacts 7(100):049. https://doi.org/10.1016/j.simpa.2020.100049
Eddelbuettel D, François R (2011) Rcpp: seamless R and C++ integration. J Stat Softw 40(8):1–18. https://doi.org/10.18637/jss.v040.i08
Eddelbuettel D, Emerson JW, Kane MJ (2021) BH: Boost C++ header files, 1.78.0-0. https://CRAN.R-project.org/package=BH. Accessed 15 Dec 2022
Epicoco I, Melle C, Cafaro M et al (2020) UDDSketch: accurate tracking of quantiles in data streams. IEEE Access 8:147604–147617. https://doi.org/10.1109/ACCESS.2020.3015599
Gan E, Ding J, Tai KS, et al (2018) Moment-based quantile sketches for efficient high cardinality aggregation queries. In: Proceedings of the VLDB Endowment 11(11). https://doi.org/10.14778/3236187.3236212
Genz A, Bretz F (2009) Computation of multivariate normal and t probabilities. lecture notes in statistics. Springer-Verlag, Heidelberg
Genz A, Bretz F, Miwa T, et al (2020) mvtnorm: multivariate normal and t distributions, 1.1-3. https://CRAN.R-project.org/package=mvtnorm. Accessed 15 Dec 2022
Gibbons JD, Chakraborti S (2010) Nonparametric statistical inference. CRC Press, New York
Greblicki W, Pawlak M (1984) Hermite series estimates of a probability density and its derivatives. J Multivar Anal 15(2):174–182. https://doi.org/10.1016/0047-259X(84)90024-1
Greblicki W, Pawlak M (1985) Pointwise consistency of the Hermite series density estimate. Stat Probab Lett 3(2):65–69. https://doi.org/10.1016/0167-7152(85)90026-4
Greblicki W, Pawlak M (2008) Nonparametric system identification. Cambridge University Press, Cambridge
Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. ACM SIGMOD Rec 30(2):58–66. https://doi.org/10.1145/376284.375670
Greenwald M, Khanna S (2016) Quantiles and equi-depth histograms over streams. In: Garofalakis M, Gehrke J, Rastogi R (eds) Data stream management. Springer, Berlin, pp 45–86
Hammer HL, Yazidi A, Rue H (2019) A new quantile tracking algorithm using a generalized exponentially weighted average of observations. Appl Intell 49(4):1406–1420. https://doi.org/10.1007/s10489-018-1335-7
Hammer HL, Yazidi A, Rue H (2019) Tracking of multiple quantiles in dynamically varying data streams. Pattern Anal Appl 23:225–237. https://doi.org/10.1007/s10044-019-00778-3
Integral (2021) TrueFX. https://www.truefx.com/. Accessed 19 Dec 2022
Jain R, Chlamtac I (1985) The P\(^{2}\) algorithm for dynamic calculation of quantiles and histograms without storing observations. Commun ACM 28(10):1076–1085. https://doi.org/10.1145/4372.4378
Jenks G (2019) Runstats, 1.8.0. https://pypi.org/project/runstats/. Accessed 2022 Dec 15
Jmaei A, Slaoui Y, Dellagi W (2017) Recursive distribution estimator defined by stochastic approximation method using Bernstein polynomials. J Nonparametr Stat 29(4):792–805. https://doi.org/10.1080/10485252.2017.1369538
Karnin Z, Lang K, Liberty E (2016) Optimal quantile approximation in streams. In: 2016 IEEE 57th annual symposium on foundations of computer science (FOCS), IEEE, pp 71–78, https://doi.org/10.1109/FOCS.2016.17
Liebscher E (1990) Hermite series estimators for probability densities. Metrika 37(6):321–343. https://doi.org/10.1007/BF02613540
Luo G, Wang L, Yi K et al (2016) Quantiles over data streams: experimental comparisons, new analyses, and further improvements. VLDB J 25(4):449–472. https://doi.org/10.1007/s00778-016-0424-7
Masson C, Rim JE (2019) DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees. In: Proceedings of the VLDB Endowment 12(12):2195–2205. https://doi.org/10.14778/3352063.3352135
Mersmann O (2021) microbenchmark: accurate timing functions, 1.4.9. https://CRAN.R-project.org/package=microbenchmark. Accessed 15 Dec 2022
Mildenberger T, Weinert H (2012) The Benchden package: benchmark densities for nonparametric density estimation. J Stat Softw 46:1–14. https://doi.org/10.18637/jss.v046.i14
Mitchell R, Frank E, Holmes G (2021) An empirical study of moment estimators for quantile approximation. ACM Transactions on Database Systems (TODS) 46(1):1–21. https://doi.org/10.1145/3442337
Naumov V, Martikainen O (2007) Exponentially weighted simultaneous estimation of several quantiles. World Acad Sci, Eng Technol 8:563–568. https://doi.org/10.5281/zenodo.1077112
Pedersen TL (2020) patchwork: the composer of plots, 1.1.2. https://CRAN.R-project.org/package=patchwork. Accessed 15 Dec 2022
R Core Team (2023) R: a language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria, https://www.R-project.org/. Accessed 01 Dec 2023
Raatikainen KE (1987) Simultaneous estimation of several percentiles. Simulation 49(4):159–163. https://doi.org/10.1177/003754978704900405
Raatikainen KE (1990) Sequential procedure for simultaneous estimation of several percentiles. Trans Soc Comput Simul 1:21–44. https://doi.org/10.5555/87953.87955
Rudis B, Dunning T, Werner A (2022) tdigest: wicked fast, accurate quantiles using t-digests, 0.4.1. https://CRAN.R-project.org/package=tdigest. Accessed 15 Dec 2022
Schubert E, Gertz M (2018) Numerically stable parallel computation of (co-) variance. In: Proceedings of the 30th international conference on scientific and statistical database management. pp 1–12, https://doi.org/10.1145/3221269.3223036
Schwartz SC (1967) Estimation of probability density by an orthogonal series. Ann Math Stat 38:1261–1265. https://doi.org/10.1214/aoms/1177698795
Slaoui Y (2014) The stochastic approximation method for estimation of a distribution function. Math Methods Statist 23(4):306–325. https://doi.org/10.3103/S1066530714040048
Slaoui Y, Jmaei A (2019) Recursive density estimators based on Robbins–Monro’s scheme and using Bernstein polynomials. Stat Its Interface 12(3):439–455. https://doi.org/10.4310/19-SII561
Stephanou M (2023) Hermiter, 2.3.0. https://cran.r-project.org/package=hermiter. Accessed 31 Dec 2023
Stephanou M, Varughese M (2021) On the properties of Hermite series based distribution function estimators. Metrika 84(4):535–559. https://doi.org/10.1007/s00184-020-00785-z
Stephanou M, Varughese M (2021) Sequential estimation of Spearman rank correlation using Hermite series estimators. J Multivar Anal 186(104):783. https://doi.org/10.1016/j.jmva.2021.104783
Stephanou M, Varughese M, Macdonald I (2017) Sequential quantiles via Hermite series density estimation. Electron J Stat 11(1):570–607. https://doi.org/10.1214/17-EJS1245
Tiwari N, Pandey PC (2019) A technique with low memory and computational requirements for dynamic tracking of quantiles. J Signal Process Syst 91(5):411–422. https://doi.org/10.1007/s11265-017-1327-6
Van Rossum G (2023) Python programming language. http://www.python.org. Accessed 31 Dec 2023
Walter GG (1977) Properties of Hermite series estimation of probability density. Ann Stat 5(6):1258–1264. https://doi.org/10.1214/aos/1176344013
Wickham H (2016) ggplot2: elegant graphics for data analysis. Springer-Verlag, New York
Xiao W (2019) Novel online algorithms for nonparametric correlations with application to analyze sensor data. In: 2019 IEEE International Conference on Big Data (Big Data), IEEE, pp 404–412, https://doi.org/10.1109/BigData47090.2019.9006483
Yazidi A, Hammer H (2017) Multiplicative update methods for incremental quantile estimation. IEEE Trans Cybernet 49(3):746–756. https://doi.org/10.1109/TCYB.2017.2779140
Zhao F, Maiyya S, Wiener R, et al (2021) Kll\(\pm \)approximate quantile sketches over dynamic datasets. In: Proceedings of the VLDB endowment 14(7):1215–1227. https://doi.org/10.14778/3450980.3450990
Acknowledgements
The views expressed in this article are those of the authors and do not necessarily reflect the views of Rand Merchant Bank. Rand Merchant Bank does not make any representations or give any warranties as to the correctness, accuracy or completeness of the information presented; nor does Rand Merchant Bank assume liability for any losses arising from errors or omissions in the information in this article. We would like to thank Ted Dunning for useful and interesting discussions. We would also like to sincerely thank the editor, associate editor and particularly the reviewers for thorough and deeply insightful feedback that helped us greatly improve this article.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Stephanou, M., Varughese, M. hermiter: R package for sequential nonparametric estimation. Comput Stat 39, 1127–1163 (2024). https://doi.org/10.1007/s00180-023-01382-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-023-01382-0