Abstract
Stochastic gradient descent (SGD) is a popular optimization method widely used in machine learning, while the variance of gradient estimation leads to slow convergence. To accelerate the speed, many variance reduction methods have been proposed. However, most of these methods require additional memory cost or computational burden on full gradient, which results in low efficiency or even unavailable while applied to real-world applications with large-scale datasets. To handle this issue, we propose a new flexible variance reduction method for SGD, named FVR-SGD, which can reduce memory overhead and speed up the convergence using flexible subset without extra computation. We analyze the details of convergence property for our method, and linear convergence rate can be guaranteed while using flexible variance reduction. Some efficient variants for distributed environment and deep neural networks are also proposed in this paper. Several numerical experiments are conducted on a genre of real-world large-scale datasets. The experimental results demonstrated that FVR-SGD outperforms the currently popular algorithms. Specifically, the proposed method can achieve up to 40% reduction in the training time to solve the optimization problem of logistic regression, SVM and neural networks.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Roux NL, Schmidt M, Bach FR (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. Adv Neural Inf Process Syst 2012:2663–2671
Reddi SJ, Hefny A, Sra S, Poczos B, Smola AJ (2015) On variance reduction in stochastic gradient descent and its asynchronous variants. Adv Neural Inf Process Syst 2015:2647–2655
Schmidt M, Le Roux N, Bach F (2017) Minimizing finite sums with the stochastic average gradient. Math Program 162(1–2):83–112
Defazio A, Bach F, Lacoste-Julien S (2014) SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv Neural Inf Process Syst 2014:1646–1654
De S, Goldstein T (2016) Efficient distributed SGD with variance reduction. In: 2016 IEEE 16th international conference on data mining (ICDM), 2016. pp 111–120
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv Neural Inf Process Syst 2013:315–323
Tang M, Huang Z, Qiao L, Du S, Peng Y, Wang C (2018) FVR-SGD: a new flexible variance-reduction method for SGD on large-scale datasets. In: The 25th international conference on neural information processing, 01, 2018. pp 181–193
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(Jul):2121–2159
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv:12125701
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:14126980
Zhang T, Hu F, Li L, Xu X, Yang Z, Chen Y (2018) An adaptive mechanism to achieve learning rate dynamically. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3495-0
Friedlander MP, Schmidt M (2012) Hybrid deterministic-stochastic methods for data fitting. SIAM J Sci Comput 34(3):A1380–A1405
De S, Yadav A, Jacobs D, Goldstein T (2016) Big batch SGD: automated inference using adaptive batch sizes. arXiv:161005792
Li M, Zhang T, Chen Y, Smola AJ (2014) Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014. pp 661–670
Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13(Jan):165–202
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12(1):145–151
Nesterov YE (1983) A method for solving the convex programming problem with convergence rate \(O(1/k^2)\). Dokl. Akad. Nauk SSSR 1983:543–547
Hocenski Mladen, Filko D (2010) Accelerated gradient learning algorithm for neural network weights update. Neural Comput Appl 19(2):219–225
Zinkevich M, Weimer M, Li L, Smola AJ (2010) Parallelized stochastic gradient descent. Adv Neural Inf Process Syst 2010:2595–2603
Recht B, Re C, Wright S, Niu F (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. Adv Neural Inf Process Syst 2011:693–701
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, Tucker P, Yang K, Le QV (2012) Large scale distributed deep networks. Adv Neural Inf Process Syst 2012:1223–1231
Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim 22(2):341–362
Shalev-Shwartz S, Zhang T (2013) Stochastic dual coordinate ascent methods for regularized loss minimization. J Mach Learn Res 14(Feb):567–599
Xie T, Liu B, Xu Y, Ghavamzadeh M, Chow Y, Lyu D, Yoon D (2018) A block coordinate ascent algorithm for mean-variance optimization. Adv Neural Inf Process Syst 2018:1073–1083
Zhang Y, Xiao L (2017) Stochastic primal-dual coordinate method for regularized empirical risk minimization. J Mach Learn Res 18(1):2939–2980
Shalev-Shwartz S, Tewari A (2011) Stochastic methods for \(\ell _1\) regularized loss minimization. J Mach Learn Res 12(2):1865–1892
Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J Optim 24(4):2057–2075
Shalev-Shwartz S, Zhang T (2014) Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Int Conf Mach Learn 2014:64–72
Liu X, Hsieh C-J (2018) Fast variance reduction method with Stochastic batch size. Int Conf Mach Learn 2018:3185–3194
Shang F, Zhou K, Cheng J, Tsang IW, Zhang L, Tao D (2018) VR-SGD: a simple stochastic variance reduction method for machine learning. arXiv:1802.09932
De S, Taylor G, Goldstein T (2015) Variance reduction for distributed stochastic gradient descent. arXiv:151201708
Zhao K, Matsukawa T, Suzuki E (2018) Retraining: a simple way to improve the ensemble accuracy of deep neural networks for image classification. In: 2018 24th international conference on pattern recognition (ICPR), 2018. IEEE, pp 860–867
Zhang T, Wiliem A, Yang S, Lovell B (2018) Tv-gan: generative adversarial network based thermal to visible face recognition. In: 2018 international conference on biometrics (ICB), 2018. IEEE, pp 174–181
Yu X, Deng F (2013) Convergence of gradient method for training ridge polynomial neural network. Neural Comput Appl 22(1):333–339
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Krizhevsky A, Hinton G (2010) Convolutional deep belief networks on cifar-10. Unpublished manuscript 40(7)
Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5(Apr):361–397
Lang K (1995) Newsweeder: learning to filter netnews. In: Machine learning proceedings 1995. Elsevier, pp 331–339
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467
Acknowledgements
This work is supported by The National Key Research and Development Program of China (2016YFB1000100).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, and there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled “Accelerating SGD using Flexible Variance Reduction on Large-Scale Datasets.”
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version of this work was accepted by the 25th International Conference on Neural Information Processing (ICONIP 2018).
Appendix
Appendix
Theorem 1
Suppose that\(w_*\) is optimum and\(\nabla F(w_*)=0\), \(\phi ^l_m\) denotes the accumulatedw inm-th epoch and \(\tilde{G}_m = \sum _{l=1}^{2n} F(\phi _m^l)/2n\), define
there exists\(\beta > 0, 0<\delta <1\)that makes:
which means our method converges linearly.
Proof
Given any i, suppose
we know that \(g_i(w)\) is convex and \(\nabla g_i(w_*)=0\), so \(g_i(w_*)=\min \limits _\eta g_i(w)\). Based on the continuity of \(g_i(w)\), we can obtain
That is,
Taking the expectation of i and using the fact that \({\mathbb {E}} [\nabla f_i(w_*)]=0\), we obtain
Inequality (18) can be obtained from the continuity (10) and convexity (11) of F.
Now we can take expectation of \(v^k_{m+1}\) with respect to \(i_k\) and obtain
The first and second inequalities use \(\parallel a + b \parallel ^2 \le 2\parallel a \parallel ^2 + 2\parallel b \parallel ^2\). The third inequality uses inequality (17) and \(\parallel \sum _{i=1}^n a_n \parallel ^2 \le n\sum _{i=1}^n \parallel a_i \parallel ^2\). The forth inequality uses inequality (18) and \(F(\tilde{w}_m)=F(\frac{1}{2n} \sum _{l=1}^{2n}\phi _m^l) \le \frac{1}{2n} \sum _{l=1}^{2n} F(\phi _m^l)\). Note that \({\mathbb {E}} v^k_{m+1} = \nabla F(w^k_{m+1}) - \nabla F(\tilde{w}_m) + \tilde{g}_m\). Conditioned on \(w_{m+1}^{k-1}\), we can obtain the bound:
The first inequality uses inequality (11). The second inequality uses inequality (19). The inner product term can be bounded as follows:
The third inequality uses inequality (10), the fourth inequality uses Cauchy–Schwartz inequality, and the fifth inequality uses inequality (11). We set \(F(\tilde{w}_m)-F(w_*)=\alpha (F(w_{m+1}^{k-1})-F(w_*))\), where \(\alpha > 0\). Then we can rewrite (20) as:
By summing the previous inequalities over \(k=1,\ldots ,2K\), taking expectation with all the history, we obtain
The first inequality is summing the previous inequality over 2K at \(m+1\) epoch. Let \(\tilde{G}^{-j}_{m+1}\) denote the \(\tilde{G}_{m+1}\) removing the selected subset j at \((m+1)\)-epoch, which means \(\tilde{G}^{-j}_{m+1} = \frac{1}{2n-2K} \sum _{l=1}^{2n-2K} F(\phi _m^l)\). The second inequality uses
The third inequality uses
Rewriting (23), we can get:
So we can get the bound:
When \(1-\mu \eta +\eta < 1\) and \(\beta > 0\) and \(\gamma > 0\) and \(\frac{\gamma }{\beta } < 1\), \(\delta\) can be limited into (0,1). So Theorem 1 needs to satisfy the following conditions:
that is
Then the results immediately follows. \(\square\)
Rights and permissions
About this article
Cite this article
Tang, M., Qiao, L., Huang, Z. et al. Accelerating SGD using flexible variance reduction on large-scale datasets. Neural Comput & Applic 32, 8089–8100 (2020). https://doi.org/10.1007/s00521-019-04315-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04315-5