PCA for large data sets with parallel data summarization

Ordonez, Carlos; Mohanam, Naveen; Garcia-Alvarado, Carlos

doi:10.1007/s10619-013-7134-6

PCA for large data sets with parallel data summarization

Published: 10 September 2013

Volume 32, pages 377–403, (2014)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Carlos Ordonez¹,
Naveen Mohanam¹ &
Carlos Garcia-Alvarado¹

845 Accesses
Explore all metrics

Abstract

Parallel processing is essential for large-scale analytics. Principal Component Analysis (PCA) is a well known model for dimensionality reduction in statistical analysis, which requires a demanding number of I/O and CPU operations. In this paper, we study how to compute PCA in parallel. We extend a previous sequential method to a highly parallel algorithm that can compute PCA in one pass on a large data set based on summarization matrices. We also study how to integrate our algorithm with a DBMS; our solution is based on a combination of parallel data set summarization via user-defined aggregations and calling the MKL parallel variant of the LAPACK library to solve Singular Value Decomposition (SVD) in RAM. Our algorithm is theoretically shown to achieve linear speedup, linear scalability on data size, quadratic time on dimensionality (but in RAM), spending most of the time on data set summarization, despite the fact that SVD has cubic time complexity on dimensionality. Experiments with large data sets on multicore CPUs show that our solution is much faster than the R statistical package as well as solving PCA with SQL queries. Benchmarking on multicore CPUs and a parallel DBMS running on multiple nodes confirms linear speedup and linear scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

References

Blakeley, J.A., Rao, V., Kunen, I., Prout, A., Henaire, M., Kleinerman, C.: .NET database programmability and extensibility in Microsoft SQL server. In: Proc. ACM SIGMOD Conference, pp. 1087–1098 (2008)
Google Scholar
Brown, G.P.: Overview of SciDB: large scale array storage, processing and analysis. In: Proc. ACM SIGMOD Conference (2010)
Google Scholar
Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: Parallel tiled QR factorization for multicore architectures. Concurr. Comput. 20(13), 1573–1590 (2008)
Article Google Scholar
Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: Proc. NIPS Conference, pp. 281–288 (2006)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Demmel, J.W.: Applied Numerical Linear Algebra, 1st edn. SIAM, Philadelphia (1997)
Book MATH Google Scholar
Dongarra, J., Duff, I.S., Sorensen, D.C., van der Vost, H.A.: Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia (1998)
Book MATH Google Scholar
Eddelbuettel, D.: Benchmarking single-and multi-core BLAS implementations and GPUs for use with R. Mathematica (2010)
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 1st edn. Prentice Hall, New York (2001)
Google Scholar
Halko, N.H., Martinsson, P.-G., Shkolnisky, Y., Tygert, M.: An algorithm for the principal component analysis of large data sets. SIAM J. Sci. Comput. 33(5), 2580–2594 (2011)
Article MATH MathSciNet Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning, 1st edn. Springer, New York (2001)
Book MATH Google Scholar
Hellerstein, J., Re, C., Schoppmann, F., Wang, D.Z., et al.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)
Google Scholar
Kanth, K.V.R., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD ’98), pp. 166–176. ACM, New York (1998)
Chapter Google Scholar
Krycha, K., Wexler, J.: Sas and teradata in-database model development using the business example of churn modeling. In: Proc. SAS Global Forum (2013). Paper id 087
Google Scholar
Kyriacou, C., Evripidou, P., Trancoso, P.: Data-driven multithreading using conventional microprocessors. IEEE Trans. Parallel Distrib. Syst. 17(10), 1176–1188 (2006)
Article Google Scholar
Lahabar, S., Narayanan, P.J.: Singular value decomposition on GPU using CUDA. In: Proc. of IEEE International Symposium on Parallel & Distributed Processing, pp. 1–10 (2009)
Google Scholar
Luo, C., Thakkar, H., Wang, H., Zaniolo, C.: A native extension of SQL for mining data streams. In: Proc. ACM SIGMOD Conference, pp. 873–875 (2005)
Google Scholar
Navas, M., Ordonez, C.: Efficient computation of PCA with SVD in SQL. In: KDD Workshop on Data Mining Using Tensors and Matrices (2009), Article No. 5
Google Scholar
Ordonez, C.: Statistical model computation with UDFs. IEEE Trans. Knowl. Data Eng. 22(12), 1752–1765 (2010)
Article Google Scholar
Ordonez, C., Cereghini, P.: SQLEM: fast clustering in SQL using the EM algorithm. In: Proc. ACM SIGMOD Conference, pp. 559–570 (2000)
Google Scholar
Ordonez, C., Mohanam, N., Garcia-Alvarado, C., Tosic, P.T., Martinez, E.: Fast PCA computation in a DBMS with aggregate UDFs and LAPACK. In: Proc. ACM CIKM Conference (2012)
Google Scholar
Ries, F., DeMarco, T., Guerrieri, R.: Triangular matrix inversion on heterogeneous multicore systems. IEEE Trans. Parallel Distrib. Syst. 23(1), 177–184 (2012)
Article Google Scholar
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Article Google Scholar

Download references

Acknowledgements

This work was supported by NSF grant IIS 0914861.

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, Houston, TX, 77204, USA
Carlos Ordonez, Naveen Mohanam & Carlos Garcia-Alvarado

Authors

Carlos Ordonez
View author publications
You can also search for this author in PubMed Google Scholar
Naveen Mohanam
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Garcia-Alvarado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Ordonez.

Additional information

Communicated by Feifei Li and Suman Nath.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ordonez, C., Mohanam, N. & Garcia-Alvarado, C. PCA for large data sets with parallel data summarization. Distrib Parallel Databases 32, 377–403 (2014). https://doi.org/10.1007/s10619-013-7134-6

Download citation

Published: 10 September 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10619-013-7134-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

PCA for large data sets with parallel data summarization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

Scalable Machine Learning on Popular Analytic Languages with Parallel Data Summarization

Scalable Machine Learning in the R Language Using a Summarization Matrix

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

PCA for large data sets with parallel data summarization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

Scalable Machine Learning on Popular Analytic Languages with Parallel Data Summarization

Scalable Machine Learning in the R Language Using a Summarization Matrix

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation