Accelerating SGD using flexible variance reduction on large-scale datasets

Tang, Mingxing; Qiao, Linbo; Huang, Zhen; Liu, Xinwang; Peng, Yuxing; Liu, Xueliang

doi:10.1007/s00521-019-04315-5

Accelerating SGD using flexible variance reduction on large-scale datasets

Original Article
Published: 24 June 2019

Volume 32, pages 8089–8100, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Mingxing Tang¹,
Linbo Qiao¹,
Zhen Huang ORCID: orcid.org/0000-0003-4819-373X¹,
Xinwang Liu¹,
Yuxing Peng¹ &
…
Xueliang Liu²

445 Accesses
Explore all metrics

Abstract

Stochastic gradient descent (SGD) is a popular optimization method widely used in machine learning, while the variance of gradient estimation leads to slow convergence. To accelerate the speed, many variance reduction methods have been proposed. However, most of these methods require additional memory cost or computational burden on full gradient, which results in low efficiency or even unavailable while applied to real-world applications with large-scale datasets. To handle this issue, we propose a new flexible variance reduction method for SGD, named FVR-SGD, which can reduce memory overhead and speed up the convergence using flexible subset without extra computation. We analyze the details of convergence property for our method, and linear convergence rate can be guaranteed while using flexible variance reduction. Some efficient variants for distributed environment and deep neural networks are also proposed in this paper. Several numerical experiments are conducted on a genre of real-world large-scale datasets. The experimental results demonstrated that FVR-SGD outperforms the currently popular algorithms. Specifically, the proposed method can achieve up to 40% reduction in the training time to solve the optimization problem of logistic regression, SVM and neural networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

FVR-SGD: A New Flexible Variance-Reduction Method for SGD on Large-Scale Datasets

CHEAPS2AGA: Bounding Space Usage in Variance-Reduced Stochastic Gradient Descent over Streaming Data and Its Asynchronous Parallel Variants

SAAGs: Biased stochastic variance reduction methods for large-scale learning

Article 05 April 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Roux NL, Schmidt M, Bach FR (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. Adv Neural Inf Process Syst 2012:2663–2671
Google Scholar
Reddi SJ, Hefny A, Sra S, Poczos B, Smola AJ (2015) On variance reduction in stochastic gradient descent and its asynchronous variants. Adv Neural Inf Process Syst 2015:2647–2655
Google Scholar
Schmidt M, Le Roux N, Bach F (2017) Minimizing finite sums with the stochastic average gradient. Math Program 162(1–2):83–112
Article MathSciNet Google Scholar
Defazio A, Bach F, Lacoste-Julien S (2014) SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv Neural Inf Process Syst 2014:1646–1654
Google Scholar
De S, Goldstein T (2016) Efficient distributed SGD with variance reduction. In: 2016 IEEE 16th international conference on data mining (ICDM), 2016. pp 111–120
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. Adv Neural Inf Process Syst 2013:315–323
Google Scholar
Tang M, Huang Z, Qiao L, Du S, Peng Y, Wang C (2018) FVR-SGD: a new flexible variance-reduction method for SGD on large-scale datasets. In: The 25th international conference on neural information processing, 01, 2018. pp 181–193
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(Jul):2121–2159
MathSciNet MATH Google Scholar
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv:12125701
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31
Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:14126980
Zhang T, Hu F, Li L, Xu X, Yang Z, Chen Y (2018) An adaptive mechanism to achieve learning rate dynamically. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3495-0
Article Google Scholar
Friedlander MP, Schmidt M (2012) Hybrid deterministic-stochastic methods for data fitting. SIAM J Sci Comput 34(3):A1380–A1405
Article MathSciNet Google Scholar
De S, Yadav A, Jacobs D, Goldstein T (2016) Big batch SGD: automated inference using adaptive batch sizes. arXiv:161005792
Li M, Zhang T, Chen Y, Smola AJ (2014) Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014. pp 661–670
Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13(Jan):165–202
MathSciNet MATH Google Scholar
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12(1):145–151
Article MathSciNet Google Scholar
Nesterov YE (1983) A method for solving the convex programming problem with convergence rate $O(1/k^2)$. Dokl. Akad. Nauk SSSR 1983:543–547
Google Scholar
Hocenski Mladen, Filko D (2010) Accelerated gradient learning algorithm for neural network weights update. Neural Comput Appl 19(2):219–225
Article Google Scholar
Zinkevich M, Weimer M, Li L, Smola AJ (2010) Parallelized stochastic gradient descent. Adv Neural Inf Process Syst 2010:2595–2603
Google Scholar
Recht B, Re C, Wright S, Niu F (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. Adv Neural Inf Process Syst 2011:693–701
Google Scholar
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Senior A, Tucker P, Yang K, Le QV (2012) Large scale distributed deep networks. Adv Neural Inf Process Syst 2012:1223–1231
Google Scholar
Nesterov Y (2012) Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J Optim 22(2):341–362
Article MathSciNet Google Scholar
Shalev-Shwartz S, Zhang T (2013) Stochastic dual coordinate ascent methods for regularized loss minimization. J Mach Learn Res 14(Feb):567–599
MathSciNet MATH Google Scholar
Xie T, Liu B, Xu Y, Ghavamzadeh M, Chow Y, Lyu D, Yoon D (2018) A block coordinate ascent algorithm for mean-variance optimization. Adv Neural Inf Process Syst 2018:1073–1083
Google Scholar
Zhang Y, Xiao L (2017) Stochastic primal-dual coordinate method for regularized empirical risk minimization. J Mach Learn Res 18(1):2939–2980
MathSciNet MATH Google Scholar
Shalev-Shwartz S, Tewari A (2011) Stochastic methods for $\ell _1$ regularized loss minimization. J Mach Learn Res 12(2):1865–1892
MathSciNet MATH Google Scholar
Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J Optim 24(4):2057–2075
Article MathSciNet Google Scholar
Shalev-Shwartz S, Zhang T (2014) Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Int Conf Mach Learn 2014:64–72
MATH Google Scholar
Liu X, Hsieh C-J (2018) Fast variance reduction method with Stochastic batch size. Int Conf Mach Learn 2018:3185–3194
Google Scholar
Shang F, Zhou K, Cheng J, Tsang IW, Zhang L, Tao D (2018) VR-SGD: a simple stochastic variance reduction method for machine learning. arXiv:1802.09932
De S, Taylor G, Goldstein T (2015) Variance reduction for distributed stochastic gradient descent. arXiv:151201708
Zhao K, Matsukawa T, Suzuki E (2018) Retraining: a simple way to improve the ensemble accuracy of deep neural networks for image classification. In: 2018 24th international conference on pattern recognition (ICPR), 2018. IEEE, pp 860–867
Zhang T, Wiliem A, Yang S, Lovell B (2018) Tv-gan: generative adversarial network based thermal to visible face recognition. In: 2018 international conference on biometrics (ICB), 2018. IEEE, pp 174–181
Yu X, Deng F (2013) Convergence of gradient method for training ridge polynomial neural network. Neural Comput Appl 22(1):333–339
Article Google Scholar
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Krizhevsky A, Hinton G (2010) Convolutional deep belief networks on cifar-10. Unpublished manuscript 40(7)
Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5(Apr):361–397
Google Scholar
Lang K (1995) Newsweeder: learning to filter netnews. In: Machine learning proceedings 1995. Elsevier, pp 331–339
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467
Article Google Scholar

Download references

Acknowledgements

This work is supported by The National Key Research and Development Program of China (2016YFB1000100).

Author information

Authors and Affiliations

Science and Technology on Parallel and Distributed Laboratory, National University of Defense Technology, Changsha, China
Mingxing Tang, Linbo Qiao, Zhen Huang, Xinwang Liu & Yuxing Peng
Hefei University of Technology, Hefei, China
Xueliang Liu

Authors

Mingxing Tang
View author publications
You can also search for this author in PubMed Google Scholar
Linbo Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xinwang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuxing Peng
View author publications
You can also search for this author in PubMed Google Scholar
Xueliang Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhen Huang.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, and there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled “Accelerating SGD using Flexible Variance Reduction on Large-Scale Datasets.”

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this work was accepted by the 25th International Conference on Neural Information Processing (ICONIP 2018).

Appendix

Theorem 1

Suppose that$w_*$ is optimum and$\nabla F(w_*)=0$, $\phi ^l_m$ denotes the accumulatedw inm-th epoch and $\tilde{G}_m = \sum _{l=1}^{2n} F(\phi _m^l)/2n$, define

$$\begin{aligned} \alpha&> 0,\\ \beta&= 2n\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) ,\\ \gamma&= (2n-2K)\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) \\&\quad + 2K\left( 8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }\right) ,\\ \delta&= \max \left( 1-\mu \eta +\eta , \frac{\gamma }{\beta }\right) , \end{aligned}$$

there exists$\beta > 0, 0<\delta <1$that makes:

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+2}^{1} - w_* \parallel ^2 + \beta {\mathbb {E}}(\tilde{G}_{m+1}-F(w_*)) \\&\quad \le \delta [{\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2 + \beta {\mathbb {E}}(\tilde{G}_{m}-F(w_*)], \end{aligned}$$

which means our method converges linearly.

Proof

Given any i, suppose

$$\begin{aligned} g_i(w)=f_i(w)-f_i(w_*)-\nabla f_i(w_*)^\top (w-w_*), \end{aligned}$$

we know that $g_i(w)$ is convex and $\nabla g_i(w_*)=0$, so $g_i(w_*)=\min \limits _\eta g_i(w)$. Based on the continuity of $g_i(w)$, we can obtain

$$\begin{aligned} 0&=g_i(w_*) \le \min \limits _\eta [g_i(w-\eta \nabla g_i(w))] \\&\le \min \limits _\eta [g_i(w)-\eta \Vert \nabla g_i(w)\Vert _2^2 + 0.5L\eta ^2 \Vert \nabla g_i(w)\Vert _2^2 ]\\&=g_i(w)- \frac{1}{2L}\Vert \nabla g_i(w)\Vert _2^2. \end{aligned}$$

That is,

$$\begin{aligned}&\Vert f_i(w)-f_i(w_*)\Vert _2^2 \le 2L[f_i(w)-f_i(w_*)\\&\quad -\nabla f_i(w_*)^\top (w-w_*)]. \end{aligned}$$

Taking the expectation of i and using the fact that ${\mathbb {E}} [\nabla f_i(w_*)]=0$, we obtain

$$\begin{aligned} {\mathbb {E}} \parallel \nabla f_i(w) - \nabla f_i(w_*) \parallel ^2 \le 2L (F(w)-F(w_*)), \end{aligned}$$

(17)

Inequality (18) can be obtained from the continuity (10) and convexity (11) of F.

$$\begin{aligned} \parallel \nabla f_i(w) - \nabla f_i(w_*) \parallel ^2&\le L^2\parallel w - w_* \parallel ^2 \nonumber \\&\le \frac{2L^2}{\mu } (F(w) - F(w_*)). \end{aligned}$$

(18)

Now we can take expectation of $v^k_{m+1}$ with respect to $i_k$ and obtain

$$\begin{aligned}&{\mathbb {E}} \parallel v^k_{m+1} \parallel ^2 = {\mathbb {E}} \parallel \nabla f_{i_k}(w^{k}_{m+1}) - \nabla f_{i_k}(\tilde{w}_m) + \tilde{g}_m \parallel ^2\nonumber \\&\quad \le 2{\mathbb {E}} \parallel \nabla f_{i_k}(w^{k}_{m+1}) - \nabla f_{i_k}(w_*) \parallel ^2 \nonumber \\&\qquad + 2{\mathbb {E}} \parallel \nabla f_{i_k}(\tilde{w}_m) - \nabla f_{i_k}(w_*) - \tilde{g}_m \parallel ^2 \nonumber \\&\quad \le 2{\mathbb {E}} \parallel \nabla f_{i_k}(w^{k}_{m+1}) - \nabla f_{i_k}(w_*) \parallel ^2 \nonumber \\&\qquad + 4{\mathbb {E}} \parallel \nabla f_{i_k}(\tilde{w}_m) - \nabla f_{i_k}(w_*) \parallel ^2 \nonumber \\&\qquad + 8 \parallel \frac{1}{2n} \sum _{l=1}^{2n} [\nabla f_{i_l}(\phi _m^l) - \nabla f_{i_l}(w_*)] \parallel ^2 \nonumber \\&\qquad + 8 \parallel \frac{1}{2n} \sum _{l=1}^{2n} \nabla f_{i_l}(w_*) \parallel ^2 \nonumber \\&\quad \le 4L(F(w^{k}_{m+1})-F(w_*)) + 8L(F(\tilde{w}_m)-F(w_*)) \nonumber \\&\qquad + \frac{4}{n} \sum _{l=1}^{2n} \parallel \nabla f_{i_l}(\phi _m^l) - \nabla f_{i_l}(w_*) \parallel ^2\nonumber \\&\quad \le 4L(F(w^{k}_{m+1})-F(w_*)) + \left( 8L + \frac{16L^2}{\mu }\right) (\tilde{G}_m - F(w_*)). \end{aligned}$$

(19)

The first and second inequalities use $\parallel a + b \parallel ^2 \le 2\parallel a \parallel ^2 + 2\parallel b \parallel ^2$. The third inequality uses inequality (17) and $\parallel \sum _{i=1}^n a_n \parallel ^2 \le n\sum _{i=1}^n \parallel a_i \parallel ^2$. The forth inequality uses inequality (18) and $F(\tilde{w}_m)=F(\frac{1}{2n} \sum _{l=1}^{2n}\phi _m^l) \le \frac{1}{2n} \sum _{l=1}^{2n} F(\phi _m^l)$. Note that ${\mathbb {E}} v^k_{m+1} = \nabla F(w^k_{m+1}) - \nabla F(\tilde{w}_m) + \tilde{g}_m$. Conditioned on $w_{m+1}^{k-1}$, we can obtain the bound:

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+1}^{k} - w_* \parallel ^2 = {\mathbb {E}} \parallel w_{m+1}^{k-1} - \eta v^{k-1}_{m+1} - w_* \parallel ^2\nonumber \\&\quad = \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - 2 \eta (w_{m+1}^{k-1}-w*)^\top {\mathbb {E}} v^{k-1}_{m+1} \nonumber \\&\qquad + \eta ^2 {\mathbb {E}} \parallel v^{k-1}_{m+1} \parallel ^2 \nonumber \\&\quad = \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - 2 \eta (w_{m+1}^{k-1}-w_*)^\top \nabla F(w^{k-1}_{m+1}) \nonumber \\&\qquad - 2 \eta (w_{m+1}^{k-1}-w_*)^\top (\tilde{g}_m - \nabla F(\tilde{w}_m)) + \eta ^2 {\mathbb {E}} \parallel v^{k-1}_{m+1} \parallel ^2\nonumber \\&\quad \le \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - 2 \eta (F(w_{m+1}^{k-1})-F(w_*)) \nonumber \\&\qquad - \mu \eta \parallel w_* - w_{m+1}^{k-1} \parallel ^2 \nonumber \\&\qquad - 2 \eta (w_{m+1}^{k-1}-w_*)^\top (\tilde{g}_m - \nabla F(\tilde{w}_m)) + \eta ^2 {\mathbb {E}} \parallel v^{k-1}_{m+1} \parallel ^2 \nonumber \\&\quad \le \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - 2 \eta (F(w_{m+1}^{k-1})-F(w_*)) \nonumber \\&\qquad - \mu \eta \parallel w_{m+1}^{k-1} - w_* \parallel ^2 \nonumber \\&\qquad - 2 \eta (w_{m+1}^{k-1}-w_*)^\top (\tilde{g}_m - \nabla F(\tilde{w}_m)) \nonumber \\&\qquad + 4 \eta ^2 L (F(w^{k-1}_{m+1})-F(w_*)) \nonumber \\&\qquad + \eta ^2\left( 8L + \frac{16L^2}{\mu }\right) (\tilde{G}_m - F(w_*)) \nonumber \\&\quad = \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - \mu \eta \parallel w_{m+1}^{k-1} - w_* \parallel ^2 \nonumber \\&\qquad - (2 \eta - 4 \eta ^2 L)(F(w^{k-1}_{m+1})-F(w_*))\nonumber \\&\qquad - 2 \eta (w_{m+1}^{k-1}-w_*)^\top (\tilde{g}_m - \nabla F(\tilde{w}_m)) \nonumber \\&\qquad + \eta ^2\left( 8L + \frac{16L^2}{\mu }\right) (\tilde{G}_m - F(w_*)). \end{aligned}$$

(20)

The first inequality uses inequality (11). The second inequality uses inequality (19). The inner product term can be bounded as follows:

$$\begin{aligned}&-2(w_{m+1}^{k-1}-w_*)^\top (\tilde{g}_m - \nabla F(\tilde{w}_m)) \nonumber \\&\quad \le 2\parallel (w_{m+1}^{k-1}-w_*)^\top \parallel \parallel \tilde{g}_m - \nabla F(\tilde{w}_m) \parallel \nonumber \\&\quad \le 2\parallel (w_{m+1}^{k-1}-w_*)^\top \parallel \left[ \frac{1}{2n} \sum _{l=1}^{2n} \parallel \nabla f_{i_l}(\phi _m^l)-\nabla F(\tilde{w}_m) \parallel \right] \nonumber \\&\quad \le 2\parallel (w_{m+1}^{k-1}-w_*)^\top \parallel \left[ \frac{L}{2n} \sum _{l=1}^{2n} \parallel \phi _m^l-\tilde{w}_m \parallel \right] \nonumber \\&\quad = \frac{1}{2n} \sum _{l=1}^{2n} [2\parallel (w_{m+1}^{k-1}-w_*)^\top \parallel L \parallel \phi _m^l-\tilde{w}_m \parallel ]\nonumber \\&\quad \le \parallel w_{m+1}^{k-1}-w_* \parallel ^2 + \frac{L^2}{2n} \sum _{l=1}^{2n} \parallel \phi _m^l-\tilde{w}_m \parallel ^2\nonumber \\&\quad \le \parallel w_{m+1}^{k-1}-w_* \parallel ^2 \nonumber \\&\qquad + \frac{L^2}{2n} \frac{2}{\mu } \sum _{l=1}^{2n}(F(\phi _m^l)-F(\tilde{w}_m)- \nabla F(\tilde{w}_m)(\phi _m^l-\tilde{w}_m))\nonumber \\&\quad = \parallel w_{m+1}^{k-1}-w_* \parallel ^2 \nonumber \\&\qquad + \frac{2L^2}{\mu } [((\tilde{G}_m-F(w_*))- (F(\tilde{w}_m)-F(w_*))]. \end{aligned}$$

(21)

The third inequality uses inequality (10), the fourth inequality uses Cauchy–Schwartz inequality, and the fifth inequality uses inequality (11). We set $F(\tilde{w}_m)-F(w_*)=\alpha (F(w_{m+1}^{k-1})-F(w_*))$, where $\alpha > 0$. Then we can rewrite (20) as:

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+1}^{k} - w_*\parallel ^2 \nonumber \\&\quad \le \parallel w_{m+1}^{k-1} - w_* \parallel ^2 - (\mu -1 )\eta \parallel w_{m+1}^{k-1} - w_* \parallel ^2 \nonumber \\&\qquad - \left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L)(F(w^{k-1}_{m+1})-F(w_*)\right) \nonumber \\&\qquad + \left( 8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }\right) (\tilde{G}_m - F(w_*)). \end{aligned}$$

(22)

By summing the previous inequalities over $k=1,\ldots ,2K$, taking expectation with all the history, we obtain

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+2}^{1} - w_* \parallel ^2 \nonumber \\&\quad \le {\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2 - (\mu -1 )\eta \sum _{i=1}^{2K} {\mathbb {E}} \parallel w_{m+1}^{k} - w_* \parallel ^2\nonumber \\&\qquad - \left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) \sum _{i=1}^{2K} {\mathbb {E}}( F(w^{i}_{m+1})-F(w_*))\nonumber \\&\qquad + 2K\left( 8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }\right) {\mathbb {E}} (\tilde{G}_m - F(w_*))\nonumber \\&\quad \le {\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2 - (\mu -1 )\eta \sum _{i=1}^{2K} {\mathbb {E}} \parallel w_{m+1}^{k} - w_* \parallel ^2\nonumber \\&\qquad - 2n\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}}(\tilde{G}_{m+1}-F(w_*))\nonumber \\&\qquad + (2n-2K) \left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}} (\tilde{G}_{m}-F(w_*))\nonumber \\&\qquad + 2K(8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }) {\mathbb {E}} (\tilde{G}_m - F(w_*))\nonumber \\&\quad \le (1-\mu \eta +\eta ){\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2\nonumber \\&\qquad - 2n\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}}(\tilde{G}_{m+1}-F(w_*))\nonumber \\&\qquad + (2n-2K) \left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}} (\tilde{G}_{m}-F(w_*))\nonumber \\&\qquad + 2K\left( 8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }\right) {\mathbb {E}} (\tilde{G}_m - F(w_*)). \end{aligned}$$

(23)

The first inequality is summing the previous inequality over 2K at $m+1$ epoch. Let $\tilde{G}^{-j}_{m+1}$ denote the $\tilde{G}_{m+1}$ removing the selected subset j at $(m+1)$-epoch, which means $\tilde{G}^{-j}_{m+1} = \frac{1}{2n-2K} \sum _{l=1}^{2n-2K} F(\phi _m^l)$. The second inequality uses

$$\begin{aligned}&-\sum _{i=1}^{2K} {\mathbb {E}} (F(w^{i}_{m+1})-F(w_*)) \\&\quad = - {\mathbb {E}}\left[ \sum _{i=1}^{2K} F(w^{i}_{m+1}) + \tilde{G}^{-j} - 2nF(w_*) \right. \\&\qquad \left. - (\tilde{G}^{-j}-(2n-2K)F(w_*))\right] \\&\quad = -2n{\mathbb {E}}(\tilde{G}_{m+1}-F(w_*)) + (2n-2K){\mathbb {E}}(\tilde{G}_{m}-F(w_*)). \end{aligned}$$

The third inequality uses

$$\begin{aligned} -\sum _{i=1}^{2K} {\mathbb {E}} \parallel w_{m+1}^{k} - w_* \parallel ^2 \le -{\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2. \end{aligned}$$

Rewriting (23), we can get:

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+2}^{1} - w_* \parallel ^2 + 2n\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}}(\tilde{G}_{m+1}-F(w_*))\nonumber \\&\quad \le (1-\mu \eta +\eta ){\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2\nonumber \\&\qquad + (2n-2K)\left( 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L\right) {\mathbb {E}}(\tilde{G}_{m}-F(w_*))\nonumber \\&\qquad + 2K\left( 8 \eta ^2 L + \frac{16 \eta ^2 L^2}{\mu } + \frac{2 \eta L^2}{\mu }\right) {\mathbb {E}} (\tilde{G}_m - F(w_*)). \end{aligned}$$

(24)

So we can get the bound:

$$\begin{aligned}&{\mathbb {E}} \parallel w_{m+2}^{1} - w_* \parallel ^2 + \beta {\mathbb {E}}(\tilde{G}_{m+1}-F(w_*)) \nonumber \\&\quad \le \delta [{\mathbb {E}} \parallel w_{m+1}^{1} - w_* \parallel ^2 + \beta {\mathbb {E}}(\tilde{G}_{m}-F(w_*)], \end{aligned}$$

(25)

When $1-\mu \eta +\eta < 1$ and $\beta > 0$ and $\gamma > 0$ and $\frac{\gamma }{\beta } < 1$, $\delta$ can be limited into (0,1). So Theorem 1 needs to satisfy the following conditions:

$$\begin{aligned}&0<1-\mu \eta +\eta < 1, \\&\quad 2 \eta + \frac{2\eta \alpha L^2}{\mu } - 4 \eta ^2 L> 0,\\&\quad \beta - \gamma = 2\eta K\left( 2 + \frac{2 \alpha L^2}{\mu } - 12 \eta L - \frac{16 \eta L^2}{\mu } - \frac{2 L^2}{\mu }\right) >0, \end{aligned}$$

that is

$$\begin{aligned}&\mu > 1,\\&\quad \eta < \min \left( \frac{1}{\mu -1},\frac{\alpha L^2 + \mu }{2 \mu L},\frac{\mu + \alpha L^2 - L^2}{12\mu L + 16L^2}\right) . \end{aligned}$$

Then the results immediately follows. $\square$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, M., Qiao, L., Huang, Z. et al. Accelerating SGD using flexible variance reduction on large-scale datasets. Neural Comput & Applic 32, 8089–8100 (2020). https://doi.org/10.1007/s00521-019-04315-5

Download citation

Received: 20 January 2019
Accepted: 18 June 2019
Published: 24 June 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s00521-019-04315-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Accelerating SGD using flexible variance reduction on large-scale datasets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FVR-SGD: A New Flexible Variance-Reduction Method for SGD on Large-Scale Datasets

CHEAPS2AGA: Bounding Space Usage in Variance-Reduced Stochastic Gradient Descent over Streaming Data and Its Asynchronous Parallel Variants

SAAGs: Biased stochastic variance reduction methods for large-scale learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Accelerating SGD using flexible variance reduction on large-scale datasets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FVR-SGD: A New Flexible Variance-Reduction Method for SGD on Large-Scale Datasets

CHEAPS2AGA: Bounding Space Usage in Variance-Reduced Stochastic Gradient Descent over Streaming Data and Its Asynchronous Parallel Variants

SAAGs: Biased stochastic variance reduction methods for large-scale learning

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Theorem 1

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation