MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning
Abstract
:1. Introduction
- We propose a maximum entropy algorithm, which is stable and consistent for hundreds of moments, surpassing other off-the-shelf algorithms with a limit of a small number of moments. Based on this robust algorithm, we develop a new Maximum Entropy Method (MEMe) which improves upon the scalability of existing machine learning algorithms by efficiently approximating computational bottlenecks using maximum entropy and fast moment estimation techniques;
- We establish the link between maximum entropy methods and variational inference under moment constraints, hence connecting the former to well-known Bayesian approximation techniques;
- We apply MEMe to the problem of estimating the log determinant, crucial to inference in determinental point processes [11], and to that of estimating the entropy of a Gaussian mixture, important to state-of-the-art information-theoretic Bayesian optimisation algorithms.
2. Theoretical Framework
2.1. Maximum Entropy as Constrained Bayesian Variational Inference
2.1.1. Variational Inference
2.1.2. MaxEnt Is Equivalent to Constrained Variational Inference
2.2. Fast Moment Estimation Techniques
2.2.1. Moments of a Gaussian Mixture
2.2.2. Moments of the Eigenvalue Spectrum of a Matrix
Algorithm 1 Stochastic Trace Estimation [20] |
|
2.3. The Proposed MaxEnt Algorithm
Algorithm 2 The Proposed MaxEnt Algorithm |
|
3. Applications
3.1. Log Determinant Estimation
3.1.1. Log Determinant Estimation as a Spectral Estimation Problem Using MaxEnt
Algorithm 3 MEMe for Log Determinant Estimation |
|
3.1.2. Experiments
Log Determinant Estimation for Synthetic Kernel Matrix
Log Determinant Estimation on Real Data
3.2. Bayesian Optimisation
3.2.1. MaxEnt for Information-Theoretic Bayesian Optimisation
Algorithm 4 MEMe for FITBO |
|
Algorithm 5 MEMe for Approximating Gaussian Mixture Entropy |
|
3.2.2. Experiments
Entropy of the Gaussian Mixture in Bayesian Optimisation
Information-Theoretic Bayesian Optimisation
4. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
MEMe | Maximum entropy method |
MaxEnt | Maximum entropy |
PD | Positive definite |
CG | Conjugate gradient |
OMxnt | Old MaxEnt algorithm proposed by Bandyopadhyay et al. [2] |
BO | Bayesian optimisation |
GP | Gaussian process |
FITBO | Fast information-theoretic Bayesian optimisation |
GM | Gaussian mixture |
VUB | Variational upper bound |
Huber | Method proposed by Huber et al. [43] for estimating the Gaussian mixture entropy |
MC | Monte Carlo sampling |
MM | Moment matching |
Quad | Numerical quadrature |
IR | Immediate regret |
Appendix A. Polynomial Approximations to the Log Determinant
Appendix A.1. Taylor Approximations Are Unsound
Appendix B. Bayesian Optimisation
Algorithm A1 Bayesian Optimisation |
|
Appendix B.1. Are Moments Sufficient to Fully Describe the Problem?
Appendix B.2. Experimental Results on Approximating the Gaussian Mixture Entropy
Methods | M = 200 | M = 400 |
---|---|---|
MEMe-10 | ||
() | () | |
MEMe-30 | ||
() | () | |
MEMeL-10 | ||
() | () | |
MEMeL-30 | ||
() | () | |
Variational | ||
Upper Bound | () | () |
Huber-2 | ||
() | () | |
MC-100 | ||
() | () | |
Moment | ||
Matching | () | () |
Methods | M = 200 | M = 400 |
---|---|---|
MEMe-10 | ||
() | () | |
MEMe-30 | ||
() | () | |
MEMeL-10 | ||
() | () | |
MEMeL-30 | ||
() | () | |
Variational | ||
Upper Bound | () | () |
Huber-2 | ||
() | () | |
MC-100 | ||
() | () | |
Moment | ||
Matching | () | () |
References
- Granziol, D.; Roberts, S.J. Entropic determinants of massive matrices. In Proceedings of the 2017 IEEE International Conference on Big Data, Boston, MA, USA, 11–14 December 2017; pp. 88–93. [Google Scholar]
- Bandyopadhyay, K.; Bhattacharya, A.K.; Biswas, P.; Drabold, D. Maximum entropy and the problem of moments: A stable algorithm. Phys. Rev. E 2005, 71, 057701. [Google Scholar] [CrossRef] [PubMed]
- Mead, L.R.; Papanicolaou, N. Maximum entropy in the problem of moments. J. Math. Phys. 1984, 25, 2404–2417. [Google Scholar] [CrossRef] [Green Version]
- Han, I.; Malioutov, D.; Shin, J. Large-scale log-determinant computation through stochastic Chebyshev expansions. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 908–917. [Google Scholar]
- Dong, K.; Eriksson, D.; Nickisch, H.; Bindel, D.; Wilson, A.G. Scalable Log Determinants for Gaussian Process Kernel Learning. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6330–6340. [Google Scholar]
- Zhang, Y.; Leithead, W.E. Approximate implementation of the logarithm of the matrix determinant in Gaussian process regression. J. Stat. Comput. Simul. 2007, 77, 329–348. [Google Scholar] [CrossRef]
- Hernández-Lobato, J.M.; Hoffman, M.W.; Ghahramani, Z. Predictive entropy search for efficient global optimization of black-box functions. In Proceedings of the 27st Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 918–926. [Google Scholar]
- Wang, Z.; Jegelka, S. Max-value Entropy Search for Efficient Bayesian Optimization. arXiv 2017, arXiv:1703.01968. [Google Scholar]
- Ru, B.; McLeod, M.; Granziol, D.; Osborne, M.A. Fast Information-theoretic Bayesian Optimisation. In Proceedings of the 2018 International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4381–4389. [Google Scholar]
- Fox, C.W.; Roberts, S.J. A tutorial on variational Bayesian inference. Artif. Intell. Rev. 2012, 38, 85–95. [Google Scholar] [CrossRef]
- Kulesza, A. Determinantal Point Processes for Machine Learning. Found. Trends Mach. Learn. 2012, 5, 123–286. [Google Scholar] [CrossRef]
- Pressé, S.; Ghosh, K.; Lee, J.; Dill, K.A. Principles of Maximum Entropy and Maximum Caliber in Statistical Physics. Rev. Mod. Phys. 2013, 85, 1115–1141. [Google Scholar] [CrossRef]
- Jaynes, E.T. Information Theory and Statistical Mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
- Giffin, A.; Cafaro, C.; Ali, S.A. Application of the Maximum Relative Entropy method to the Physics of Ferromagnetic Materials. Phys. A Stat. Mech. Appl. 2016, 455, 11–26. [Google Scholar] [CrossRef]
- Neri, C.; Schneider, L. Maximum Entropy Distributions inferred from Option Portfolios on an Asset. Financ. Stoch. 2012, 16, 293–318. [Google Scholar] [CrossRef]
- Bretthorst, G.L. The maximum entropy method of moments and Bayesian probability theory. AIP Conf. Proc. 2013, 1553, 3–15. [Google Scholar]
- Beal, M.J. Variational Algorithms for Approximate Bayesian Inference. Master’s Thesis, University of London, London, UK, 2003. [Google Scholar]
- Caticha, A. Entropic Inference and the Foundations of Physics (Monograph Commissioned by the 11th Brazilian Meeting on Bayesian Statistics-EBEB-2012. 2012. Available online: https://www.albany.edu/physics/ACaticha-EIFP-book.pdf (accessed on 31 May 2019).
- Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
- Hutchinson, M.F. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Commun. Stat. Simul. Comput. 1990, 19, 433–450. [Google Scholar] [CrossRef]
- Skilling, J. The eigenvalues of mega-dimensional matrices. In Maximum Entropy and Bayesian Methods; Springer: Berlin, Germany, 1989; pp. 455–466. [Google Scholar]
- Friedman, J.; Hastie, T.; Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008, 9, 432–441. [Google Scholar] [CrossRef] [PubMed]
- Rasmussen, C.E.; Williams, C.K. Gaussian Processes for Machine Learning; The MIT Press: Cambridge, MA, USA, 2006; pp. 715–719. [Google Scholar]
- MacKay, D.J. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Van Aelst, S.; Rousseeuw, P. Minimum volume ellipsoid. Wiley Interdiscip. Rev. Comput. Stat. 2009, 1, 71–82. [Google Scholar] [CrossRef]
- Wainwright, M.J.; Jordan, M.I. Log-determinant relaxation for approximate inference in discrete Markov random fields. IEEE Trans. Signal Process. 2006, 54, 2099–2109. [Google Scholar] [CrossRef] [Green Version]
- Gershgorin, S.A. Über die Abgrenzung der Eigenwerte einer Matrix. Izvestija Akademii Nauk SSSR Serija Matematika 1931, 6, 749–754. [Google Scholar]
- Ubaru, S.; Chen, J.; Saad, Y. Fast Estimation of tr(f(A)) via Stochastic Lanczos Quadrature. SIAM J. Matrix Anal. Appl. 2016, 38, 1075–1099. [Google Scholar] [CrossRef]
- Granziol, D.; Roberts, S. An Information and Field Theoretic approach to the Grand Canonical Ensemble. arXiv 2017, arXiv:1703.10099. [Google Scholar]
- Fitzsimons, J.; Granziol, D.; Cutajar, K.; Osborne, M.; Filippone, M.; Roberts, S. Entropic Trace Estimates for Log Determinants. arXiv 2017, arXiv:1704.07223. [Google Scholar]
- Davis, T.A.; Hu, Y. The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 2011, 38, 1. [Google Scholar] [CrossRef]
- Hennig, P.; Osborne, M.A.; Girolami, M. Probabilistic numerics and uncertainty in computations. Proc. R. Soc. A 2015, 471, 20150142. [Google Scholar] [CrossRef] [Green Version]
- Bergstra, J.S.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; pp. 2546–2554. [Google Scholar]
- Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 2951–2959. [Google Scholar]
- Thornton, C.; Hutter, F.; Hoos, H.H.; Leyton-Brown, K. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 847–855. [Google Scholar]
- Hoffman, M.; Shahriari, B.; Freitas, N. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–24 April 2014; pp. 365–374. [Google Scholar]
- Lizotte, D.J.; Wang, T.; Bowling, M.H.; Schuurmans, D. Automatic Gait Optimization with Gaussian Process Regression. IJCAI 2007, 7, 944–949. [Google Scholar]
- Martinez-Cantin, R.; de Freitas, N.; Doucet, A.; Castellanos, J.A. Active policy learning for robot planning and exploration under uncertainty. Robotics Sci. Syst. 2007, 3, 321–328. [Google Scholar]
- Azimi, J.; Jalali, A.; Fern, X. Hybrid batch Bayesian optimization. arXiv 2012, arXiv:1202.5597. [Google Scholar]
- Brochu, E.; Cora, V.M.; de Freitas, N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv 2010, arXiv:1012.2599. [Google Scholar]
- Hennig, P.; Schuler, C.J. Entropy search for information-efficient global optimization. J. Mach. Learn. Res. 2012, 13, 1809–1837. [Google Scholar]
- Hershey, J.R.; Olsen, P.A. Approximating the Kullback Leibler divergence between Gaussian mixture models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, HI, USA, 15–20 April 2007. [Google Scholar]
- Huber, M.F.; Bailey, T.; Durrant-Whyte, H.; Hanebeck, U.D. On entropy approximation for Gaussian mixture random vectors. In Proceedings of the IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, Seoul, South Korea, 20–22 August 2008; pp. 181–188. [Google Scholar]
- Molga, M.; Smutnicki, C. Test Functions for Optimization Needs. Available online: http://www.zsd.ict.pwr.wroc.pl/files/docs/functions.pdf (accessed on 30 May 2005).
- Dixon, L.C.W. The global optimization problem. An introduction. Toward Glob. Optim. 1978, 2, 1–15. [Google Scholar]
- Billingsley, P. Probability and Measure; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
l | MEMe | Chebyshev | Lanczos | |
---|---|---|---|---|
0.05 | 0.0014 | 0.0037 | 0.0024 | |
0.15 | 0.0522 | 0.0104 | 0.0024 | |
0.25 | 0.0387 | 0.0795 | 0.0072 | |
0.35 | 0.0263 | 0.2302 | 0.0196 | |
0.45 | 0.0284 | 0.3439 | 0.0502 | |
0.55 | 0.4089 | 0.0646 | ||
0.65 | 0.5049 | 0.0838 | ||
0.75 | 0.0086 | 0.5049 | 0.1050 | |
0.85 | 0.0177 | 0.5358 | 0.1199 |
Methods | ||
---|---|---|
MEMe-10 | ||
MEMe-30 | ||
MEMeL-10 | ||
MEMeL-30 | ||
VUB | ||
Huber-2 | ||
MC-100 | ||
MM |
Methods | ||
---|---|---|
MEMe-10 | ||
MEMe-30 | ||
MEMeL-10 | ||
MEMeL-30 | ||
VUB | ||
Huber-2 | ||
MC-100 | ||
MM |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Granziol, D.; Ru, B.; Zohren, S.; Dong, X.; Osborne, M.; Roberts, S. MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning. Entropy 2019, 21, 551. https://doi.org/10.3390/e21060551
Granziol D, Ru B, Zohren S, Dong X, Osborne M, Roberts S. MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning. Entropy. 2019; 21(6):551. https://doi.org/10.3390/e21060551
Chicago/Turabian StyleGranziol, Diego, Binxin Ru, Stefan Zohren, Xiaowen Dong, Michael Osborne, and Stephen Roberts. 2019. "MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning" Entropy 21, no. 6: 551. https://doi.org/10.3390/e21060551
APA StyleGranziol, D., Ru, B., Zohren, S., Dong, X., Osborne, M., & Roberts, S. (2019). MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning. Entropy, 21(6), 551. https://doi.org/10.3390/e21060551