A fast unified algorithm for solving group-lasso penalize learning problems | Statistics and Computing Skip to main content

Advertisement

Log in

A fast unified algorithm for solving group-lasso penalize learning problems

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

This paper concerns a class of group-lasso learning problems where the objective function is the sum of an empirical loss and the group-lasso penalty. For a class of loss function satisfying a quadratic majorization condition, we derive a unified algorithm called groupwise-majorization-descent (GMD) for efficiently computing the solution paths of the corresponding group-lasso penalized learning problem. GMD allows for general design matrices, without requiring the predictors to be group-wise orthonormal. As illustration examples, we develop concrete algorithms for solving the group-lasso penalized least squares and several group-lasso penalized large margin classifiers. These group-lasso models have been implemented in an R package gglasso publicly available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org/web/packages/gglasso. On simulated and real data, gglasso consistently outperforms the existing software for computing the group-lasso that implements either the classical groupwise descent algorithm or Nesterov’s method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Nat. Acad. Sci. 96(12), 6745 (1999)

    Article  Google Scholar 

  • Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57, 1413–1457 (2004)

    Article  MATH  Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–451 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R.: Regularized paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)

    Article  Google Scholar 

  • Fu, W.: Penalized regressions: the bridge versus the lasso. J. Comput. Gr. Stat. 7(3), 397–416 (1998)

    Google Scholar 

  • Genkin, A., Lewis, D., Madigan, D.: Large-scale Bayesian logistic regression for text categorization. Technometrics 49(3), 291–304 (2007)

    Article  MathSciNet  Google Scholar 

  • Gorman, R., Sejnowski, T.: Analysis of hidden units in a layered network trained to classify sonar targets. Neural Netw. 1(1), 75–89 (1988)

    Article  Google Scholar 

  • Graham, K., de Las Morenas, A., Tripathi, A., King, C., Kavanah, M., Mendez, J., Stone, M., Slama, J., Miller, M., Antoine, G., et al.: Gene expression in histologically normal epithelium from breast cancer patients and from cancer-free prophylactic mastectomy patients shares a similar profile. Br. J. Cancer 102(8), 1284–1293 (2010)

    Article  Google Scholar 

  • Hunter, D., Lange, K.: A tutorial on MM algorithms. Am. Stat. 58(1), 30–37 (2004)

    Article  MathSciNet  Google Scholar 

  • Lange, K., Hunter, D., Yang, I.: Optimization transfer using sur- rogate objective functions (with discussion). J. Comput. Gr. Stat. 9, 1–20 (2000)

  • Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections, Arizona State University. URL: http://www.public.asu.edu/jye02/Software/SLEP (2009)

  • Meier, L., van de Geer, S., Bühlmann, P.: The group lasso for logistic regression. J. R. Stat. Soc. Ser. B 70, 53–71 (2008)

    Article  MATH  Google Scholar 

  • Nesterov, Y.: ‘Introductory lectures on convex optimization: A basic course’, Operations Research. (2004)

  • Nesterov, Y.: Gradient methods for minimizing composite objective function, Technical report, Technical Report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL) (2007)

  • Osborne, M., Presnell, B., Turlach, B.: A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20(3), 389–403 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  • Quinlan, J.: Combining instance-based and model-based learning. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 236–243 (1993)

  • Sæbø, S., Almøy, T., Aarøe, J., Aastveit, A.: St-pls: a multi-directional nearest shrunken centroid type classifier via pls. J. Chemom. 22(1), 54–62 (2008)

    Article  Google Scholar 

  • Scheetz, T., Kim, K., Swiderski, R., Philp, A., Braun, T., Knudtson, K., Dorrance, A., DiBona, G., Huang, J., Casavant, T., et al.: Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Nat. Acad. Sci. 103(39), 14429–14434 (2006)

    Article  Google Scholar 

  • Segal, M., Dahlquist, K., Conklin, B.: Regression approaches for microarray data analysis. J. Comput. Biol. 10(6), 961–980 (2003)

    Article  Google Scholar 

  • Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., D’Amico, A., Richie, J., et al.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2), 203–209 (2002)

    Article  Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)

    MATH  MathSciNet  Google Scholar 

  • Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R.: Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser. B 74, 245–266 (2012)

    Article  MathSciNet  Google Scholar 

  • Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  • Vogt, J., Roth, V.: A complete analysis of the \(l_{1, p}\) group-lasso. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12), ICML 2012, Omnipress, pp 185–192 (2012)

  • Wang, H., Leng, C.: A note on adaptive group lasso. Comput. Stat. Data Anal. 52, 5277–5286 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  • Wang, L., Zhu, J., Zou, H.: Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 24, 412–419 (2008)

    Article  Google Scholar 

  • Wu, T., Lange, K.: Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2, 224–244 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  • Wu, T., Lange, K.: The MM alternative to EM. Stat. Sci. 4, 492–505 (2010)

    Article  MathSciNet  Google Scholar 

  • Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68, 49–67 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  • Zhang, T.: Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 32, 56–85 (2004)

    Article  MATH  Google Scholar 

  • Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006)

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors thank the editor, an associate editor and two referees for their helpful comments and suggestions. This work is supported in part by NSF Grant DMS-08-46068.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui Zou.

Appendix: Proofs

Appendix: Proofs

Proof of Lemma 1

Part (1). For any \(\varvec{\beta }\) and \(\varvec{\beta }^*\), write \(\varvec{\beta }-\varvec{\beta }^*=V\) and define \(g(t)=L(\varvec{\beta }^*+tV \mid \mathbf{D})\) so that

$$\begin{aligned} g(0)=L(\varvec{\beta }^* \mid \mathbf{D}), \quad g(1)=L(\varvec{\beta }\mid \mathbf{D}). \end{aligned}$$

By the mean value theorem, \(\exists \ a \in (0,1)\) such that

$$\begin{aligned} g(1)=g(0)+g^{\prime }(a)=g(0)+g^{\prime }(0)+[g^{\prime }(a)-g^{\prime }(0)]. \end{aligned}$$
(25)

Write \(\Phi _f^{\prime }=\frac{\partial {\Phi (y,f)}}{\partial {f}}\). Note that

$$\begin{aligned} g^{\prime }(t)=\frac{1}{n}\sum ^n_{i=1}\tau _i\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i (\varvec{\beta }^*+tV))(\mathbf{x}^{{ \mathsf T}}_i V). \end{aligned}$$
(26)

Thus \(g^{\prime }(0)=(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}}\nabla L(\varvec{\beta }^* | \mathbf{D}).\) Moreover, from (26) we have

$$\begin{aligned} |g^{\prime }(a)-g^{\prime }(0)|&= \vert \frac{1}{n}\sum ^n_{i=1}\tau _i[\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i (\varvec{\beta }^*+aV))\nonumber \\&-\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i \varvec{\beta }^*)](\mathbf{x}^{{ \mathsf T}}_i V) \vert \nonumber \\&\le \frac{1}{n}\sum ^n_{i=1} \tau _i|\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i (\varvec{\beta }^*+aV))\nonumber \\&-\Phi _f^{\prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i \varvec{\beta }^*) | |\mathbf{x}^{{ \mathsf T}}_i V| \nonumber \\&\le \frac{1}{n}\sum ^n_{i=1} C\tau _i|\mathbf{x}^{{ \mathsf T}}_i aV | |\mathbf{x}^{{ \mathsf T}}_i V| \end{aligned}$$
(27)
$$\begin{aligned}&\le \frac{1}{n}\sum ^n_{i=1} C \tau _i \Vert \mathbf{x}^{{ \mathsf T}}_i V \Vert ^2_2 \nonumber \\&= \frac{C}{n}V^{{ \mathsf T}} [ \mathbf{X}^{{ \mathsf T}} \varvec{\Gamma }\mathbf{X}] V, \end{aligned}$$
(28)

where in (27) we have used the inequality \(| \Phi ^{\prime }(y,f_1)-\Phi ^{\prime }(y,f_2)| \le C |f_1-f_2|\). Plugging (27) into (25) we have

$$\begin{aligned} L(\varvec{\beta }\mid \mathbf{D})&\le L(\varvec{\beta }^* \mid \mathbf{D})+(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}}\nabla L(\varvec{\beta }^* | \mathbf{D})\\&+\frac{1}{2}(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}} \mathbf{H}( \varvec{\beta }- \varvec{\beta }^*), \end{aligned}$$

with \( \mathbf{H}=\frac{2C}{n}\mathbf{X}^{{ \mathsf T}} \varvec{\Gamma }\mathbf{X}. \)

Part (2). Write \(\Phi _f^{\prime \prime }=\frac{\partial {\Phi ^2(y,f)}}{\partial {f^2}}\). By Taylor’s expansion, \(\exists \ b \in (0,1)\) such that

$$\begin{aligned} g(1)=g(0)+g^{\prime }(0)+g^{\prime \prime }(b). \end{aligned}$$
(29)

Note that

$$\begin{aligned} g^{\prime \prime }(b)&= \frac{1}{n}\sum ^n_{i=1}\tau _i \Phi _f^{\prime \prime }(y_i,\mathbf{x}^{{ \mathsf T}}_i (\varvec{\beta }^*+bV))(\mathbf{x}^{{ \mathsf T}}_i V)^2\nonumber \\&\le \frac{1}{n}\sum ^n_{i=1}C_2\tau _i (\mathbf{x}^{{ \mathsf T}}_i V)^2, \end{aligned}$$
(30)

where we have used the inequality \(\Phi _f^{\prime \prime } \le C_2\). Plugging (30) into (29) we have

$$\begin{aligned} L(\varvec{\beta }\mid \mathbf{D})&\le L(\varvec{\beta }^* \mid \mathbf{D})+(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}}\nabla L(\varvec{\beta }^* | \mathbf{D})\\&+\frac{1}{2}(\varvec{\beta }-\varvec{\beta }^*)^{{ \mathsf T}} \mathbf{H}( \varvec{\beta }- \varvec{\beta }^*), \end{aligned}$$

with \( \mathbf{H}=\frac{C_2}{n}\mathbf{X}^{{ \mathsf T}} \varvec{\Gamma }\mathbf{X}. \) \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Zou, H. A fast unified algorithm for solving group-lasso penalize learning problems. Stat Comput 25, 1129–1141 (2015). https://doi.org/10.1007/s11222-014-9498-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-014-9498-5

Keywords

Navigation