Residual projection for quantile regression in vertically partitioned big data | Data Mining and Knowledge Discovery Skip to main content

Advertisement

Log in

Residual projection for quantile regression in vertically partitioned big data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Standard regression techniques model only the mean of the response variable. Quantile regression (QR) is more powerful in that it depicts a comprehensive relationship between the response variable and independent covariates at different quantiles. It is particularly useful for non-normally distributed data with skewness or heterogeneity, which appear routinely in many scientific fields, such as economics, finance, public health and biology. Although its theory has been well developed in the literature, its computation in big data still faces multiple challenges, especially for vertically stored big data in modern distributed environments, where communication efficiency and security are usually the primary considerations. While the popular alternating direction method of multipliers (ADMM) provides a general computational solution, its slow convergence becomes a bottleneck when communication cost dominates local computational consumption, such as Internet of Things (IoT) networks. Motivated by the residual projection technique, in this paper we propose an innovative iterative parallel framework, PIQR, that converges faster and has a more secure data transmission plan, and establish its convergence property. This framework is further extended to composite quantile regression (CQR), a modified QR technique that improves estimation efficiency at extreme quantiles. Simulation studies show that both the ADMM-based method and the PIQR enjoy favorable estimation accuracy in distributed environments. While PIQR is inferior to the ADMM-based method at local computation, it requires much fewer iterations to achieve convergence, and hence significantly improves the overall computational efficiency when communication cost is the dominating factor. Moreover, PIQR transmits only data involving the residual information between different machines, and can better prevent the leakage of important data information compared with the ADMM-based method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Ai M, Wang F, Yu J, Zhang H (2021) Optimal subsampling for large-scale quantile regression. J Complex 62:101512

    Article  MathSciNet  MATH  Google Scholar 

  • Allen DE, Gerrans P, Powell R, Singh AK (2009) Quantile regression: its application in investment analysis. Finsia J Appl Finance 1(4):7–12

    Google Scholar 

  • Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122

    Article  MATH  Google Scholar 

  • Briollais L, Durrieu G (2014) Application of quantile regression to recent genetic and -omic studies. Hum Genet 133(8):951–966

    Article  Google Scholar 

  • Chen C, Wei Y (2005) Computational issues for quantile regression. Sankhyā Indian J Stat 67(2):399–417

    MathSciNet  MATH  Google Scholar 

  • Chen X, Xie M-G (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24(4):1655–1684

    MathSciNet  MATH  Google Scholar 

  • Chen L, Zhou Y (2020) Quantile regression in big data: a divide and conquer based strategy. Comput Stat Data Anal 144:106892

    Article  MathSciNet  MATH  Google Scholar 

  • Chen S, Billings SA, Luo W (1989) Orthogonal least squares methods and their application to non-linear system identification. Int J Control 50(5):1873–1896

    Article  MATH  Google Scholar 

  • Chen X, Liu W, Zhang Y (2019) Quantile regression under memory constraint. Ann Stat 47(6):3244–3273

    Article  MathSciNet  MATH  Google Scholar 

  • Chen X, Liu W, Mao X, Yang Z (2020) Distributed high-dimensional regression under a quantile loss function. J Mach Learn Res 21(182):1–43

    MathSciNet  MATH  Google Scholar 

  • Fitzenberger B, Koenker R, Machado JAF (2013) Economic applications of quantile regression. Physica-Verlag Heidelberg, New York

    Google Scholar 

  • Gamal ME, Lai L (2015) Are Slepian–Wolf Rates necessary for distributed parameter estimation? In: 2015 53rd annual Allerton conference on communication, control, and computing (Allerton), IEEE. pp 1249–1255

  • Gu Y, Zou H (2020) Sparse composite quantile regression in ultrahigh dimensions with tuning parameter calibration. IEEE Trans Inf Theory 66(11):7132–7154

    Article  MathSciNet  MATH  Google Scholar 

  • He X, Pan X, Tan KM, Zhou WX (2021) Smoothed quantile regression for large-scale inference. J Econom. https://doi.org/10.1016/j.jeconom.2021.07.010

    Article  MATH  Google Scholar 

  • Hu A, Jiao Y, Liu Y, Shi Y, Wu Y (2021) Distributed quantile regression for massive heterogeneous data. Neurocomputing 448:249–262

    Article  Google Scholar 

  • Huang C, Huo X (2019) A distributed one-step estimator. Math Program 174(1):41–76

    Article  MathSciNet  MATH  Google Scholar 

  • Hunter DR, Lange K (2000) Quantile regression via an MM algorithm. J Comput Gr Stat 9(1):60–77

    MathSciNet  Google Scholar 

  • Hunter DR, Lange K (2000) Optimization transfer using surrogate objective functions: rejoinder. J Comput Gr Stat 9(1):52–59

    Google Scholar 

  • Ivkin N, Rothchild D, Ullah E, Braverman V, Stoica I, Arora R (2019) Communication-efficient distributed SGD with sketching. In: Proceedings of the 33rd conference on neural information processing systems (NeurIPS), pp 1–11

  • Jiang R, Yu K (2022) Renewable quantile regression for streaming data sets. Neurocomputing 508:208–224

    Article  Google Scholar 

  • Jordan MI, Lee JD, Yang Y (2019) Communication-efficient distributed statistical inference. J Am Stat Assoc 114(526):668–681

    Article  MathSciNet  MATH  Google Scholar 

  • Kibria BG, Joarder AH (2006) A short review of multivariate T-distribution. J Stat Res 40(1):59–72

    MathSciNet  Google Scholar 

  • Koenker R (2017) Quantreg: quantile regression. https://CRAN.R-project.org/package=quantreg

  • Koenker R (2005) Quantile regression. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  • Koenker R, Bassett JG (1978) Regression quantiles. Econometrica 46(1):33–50

    Article  MathSciNet  MATH  Google Scholar 

  • Konečnỳ J, McMahan HB, Yu FX, Richtárik P, Suresh AT, Bacon D (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492

  • Lange K, Hunter DR, Yang I (2000) Optimization transfer using surrogate objective functions. J Comput Gr Stat 9(1):1–20

    MathSciNet  Google Scholar 

  • Lee JD, Liu Q, Sun Y, Taylor JE (2017) Communication-efficient sparse regression. J Mach Learn Res 18(1):115–144

    MathSciNet  MATH  Google Scholar 

  • Lin N, Xi R (2011) Aggregated estimating equation estimation. Stat Interface 4(1):73–83

    Article  MathSciNet  MATH  Google Scholar 

  • Li A, Sun J, Wang B, Duan L, Li S, Chen Y, Li H (2020) LotteryFL: personalized and communication-efficient federated learning with lottery ticket hypothesis on non-IID datasets. arXiv preprint arXiv:2008.03371

  • Miao W, Narayanan V, Li J-S (2020) Parallel residual projection: a new paradigm for solving linear inverse problems. Sci Rep 10(1):12846

    Article  Google Scholar 

  • Pan R, Ren T, Guo B, Li F, Li G, Wang H (2022) A note on distributed quantile regression by pilot sampling and one-step updating. J Bus Econ Stat 40(4):1691–1700

    Article  MathSciNet  Google Scholar 

  • Peng L, Huang Y (2008) Survival analysis with quantile regression models. J Am Stat Assoc 103(482):637–649

    Article  MathSciNet  MATH  Google Scholar 

  • Pietrosanu M, Gao J, Kong L, Jiang B, Niu D (2021) Advanced algorithms for penalized quantile and composite quantile regression. Comput Stat 36(1):333–346

    Article  MathSciNet  MATH  Google Scholar 

  • Portnoy S, Koenker R (1997) The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Stat Sci 12(4):279–300

    Article  MathSciNet  MATH  Google Scholar 

  • R Development Core Team (2013) R: a language and environment for statistical computing. http://www.R-project.org

  • Royen T (1995) On some central and non-central multivariate chi-square distributions. Stat Sin 5:373–397

    MathSciNet  MATH  Google Scholar 

  • Sherwood B, Wang L, Zhou X-H (2013) Weighted quantile regression for analyzing health care cost data with missing covariates. Stat Med 32(28):4967–4979

    Article  MathSciNet  Google Scholar 

  • Shi L, Ye Y, Chu X, Lu G (2020) Computation bits maximization in a backscatter assisted wirelessly powered MEC network. IEEE Commun Lett 25(2):528–532

    Article  Google Scholar 

  • Takeuchi I, Le QV, Sears TD, Smola AJ (2006) Nonparametric quantile estimation. J Mach Learn Res 7(45):1231–1264

    MathSciNet  MATH  Google Scholar 

  • Tan KM, Battey H, Zhou WX (2022) Communication-constrained distributed quantile regression with optimal statistical guarantees. J Mach Learn Res 23:1–61

    Google Scholar 

  • Trofimov I, Genkin A (2017) Distributed coordinate descent for generalized linear models with regularization. Pattern Recognit Image Anal 27(2):349–364

    Article  Google Scholar 

  • Trofimov I, Genkin A (2015) Distributed coordinate descent for L1-regularized logistic regression. In: International conference on analysis of images, social networks and texts, Springer. pp 243–254

  • Volgushev S, Chao S-K, Cheng G (2019) Distributed inference for quantile regression processes. Ann Stat 47(3):1634–1662

    Article  MathSciNet  MATH  Google Scholar 

  • Wang H, Li C (2017) Distributed quantile regression over sensor networks. IEEE Trans Signal Inf Process Netw 4(2):338–348

    MathSciNet  Google Scholar 

  • Wang H, Ma Y (2021) Optimal subsampling for quantile regression in big data. Biometrika 108(1):99–112

    Article  MathSciNet  MATH  Google Scholar 

  • Wang L, Wu Y, Li R (2012) Quantile regression for analyzing heterogeneity in ultra-high dimension. J Am Stat Assoc 107(497):214–222

    Article  MathSciNet  MATH  Google Scholar 

  • Wu TT, Lange K (2008) Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2(1):224–244

    Article  MathSciNet  MATH  Google Scholar 

  • Wu Y, Liu Y (2009) Variable selection in quantile regression. Stat Sin 19(2):801–817

    MathSciNet  MATH  Google Scholar 

  • Xi R, Lin N, Chen Y (2008) Compression and aggregation for logistic regression analysis in data cubes. IEEE Trans Knowl Data Eng 21(4):479–492

    Google Scholar 

  • Yang J, Meng X, Mahoney MW (2014) Quantile regression for large-scale applications. SIAM J Sci Comput 36(5):78–110

    Article  MathSciNet  MATH  Google Scholar 

  • Yang Q, Liu Y, Chen T, Tong Y (2019) Federated machine learning: concept and applications. ACM Trans Intell Syst Technol 10(2):1–19

    Article  Google Scholar 

  • Ye Y, Shi L, Chu X, Li D, Lu G (2021) Delay minimization in wireless powered mobile edge computing with hybrid Backcom and AT. IEEE Wirel Commun Lett 10(7):1532

    Article  Google Scholar 

  • Yu L, Lin N (2017) ADMM for penalized quantile regression in big data. Int Stat Rev 85(3):494–518

    Article  MathSciNet  Google Scholar 

  • Yu K, Lu Z, Stander J (2003) Quantile regression: applications and current research areas. J R Stat Soc Ser D 52(3):331–350

    MathSciNet  Google Scholar 

  • Yu L, Lin N, Wang L (2017) A parallel algorithm for large-scale nonconvex penalized quantile regression. J Comput Gr Stat 26(4):935–939

    Article  MathSciNet  Google Scholar 

  • Zheng H, Kulkarni SR, Poor HV (2010) Attribute-distributed learning: models, limits, and algorithms. IEEE Trans Signal Process 59(1):386–398

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang Y, Duchi JC, Wainwright MJ (2013a) Communication-efficient algorithms for statistical optimization. J Mach Learn Res 14(1):3321–3363

  • Zhang Y, Duchi JC, Jordan MI, Wainwright MJ (2013b) Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In: Proceedings of the 26th international conference on neural information processing systems (NIPS), pp 2328–2336

  • Zou H, Yuan M (2008) Composite quantile regression and the oracle model selection theory. Ann Stat 36(3):1108–1126

    Article  MathSciNet  MATH  Google Scholar 

  • Zou Y, Xu J, Gong S, Guo Y, Niyato D, Cheng W (2019) Backscatter-aided hybrid data offloading for wireless powered edge sensor networks. In: 2019 IEEE global communications conference (GLOBECOM). IEEE, pp 1–6

Download references

Funding

Nan Lin’s work is supported by NVDIA GPU grant program. Ye Fan’s work is supported by Initial Scientific Research Fund of Young Teachers in Capital University of Economics and Business [Grant No. XRZ2022062], and partly supported by Special Fund for Basic Scientific Research of Beijing Municipal Colleges in Capital University of Economics and Business [Grant No. QNTD202207]. Jr-Shin Li’s work is supported by the Air Force Office of Scientific Research under the award FA9550-21-1-0335.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nan Lin.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this work.

Additional information

Responsible editor: Aristides Gionis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 797 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fan, Y., Li, JS. & Lin, N. Residual projection for quantile regression in vertically partitioned big data. Data Min Knowl Disc 37, 710–735 (2023). https://doi.org/10.1007/s10618-022-00914-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00914-4

Keywords

Navigation