Abstract
Parallel processing is essential for large-scale analytics. Principal Component Analysis (PCA) is a well known model for dimensionality reduction in statistical analysis, which requires a demanding number of I/O and CPU operations. In this paper, we study how to compute PCA in parallel. We extend a previous sequential method to a highly parallel algorithm that can compute PCA in one pass on a large data set based on summarization matrices. We also study how to integrate our algorithm with a DBMS; our solution is based on a combination of parallel data set summarization via user-defined aggregations and calling the MKL parallel variant of the LAPACK library to solve Singular Value Decomposition (SVD) in RAM. Our algorithm is theoretically shown to achieve linear speedup, linear scalability on data size, quadratic time on dimensionality (but in RAM), spending most of the time on data set summarization, despite the fact that SVD has cubic time complexity on dimensionality. Experiments with large data sets on multicore CPUs show that our solution is much faster than the R statistical package as well as solving PCA with SQL queries. Benchmarking on multicore CPUs and a parallel DBMS running on multiple nodes confirms linear speedup and linear scalability.





Similar content being viewed by others
References
Blakeley, J.A., Rao, V., Kunen, I., Prout, A., Henaire, M., Kleinerman, C.: .NET database programmability and extensibility in Microsoft SQL server. In: Proc. ACM SIGMOD Conference, pp. 1087–1098 (2008)
Brown, G.P.: Overview of SciDB: large scale array storage, processing and analysis. In: Proc. ACM SIGMOD Conference (2010)
Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: Parallel tiled QR factorization for multicore architectures. Concurr. Comput. 20(13), 1573–1590 (2008)
Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G.R., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. In: Proc. NIPS Conference, pp. 281–288 (2006)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Demmel, J.W.: Applied Numerical Linear Algebra, 1st edn. SIAM, Philadelphia (1997)
Dongarra, J., Duff, I.S., Sorensen, D.C., van der Vost, H.A.: Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia (1998)
Eddelbuettel, D.: Benchmarking single-and multi-core BLAS implementations and GPUs for use with R. Mathematica (2010)
Garcia-Molina, H., Ullman, J.D., Widom, J.: Database Systems: The Complete Book, 1st edn. Prentice Hall, New York (2001)
Halko, N.H., Martinsson, P.-G., Shkolnisky, Y., Tygert, M.: An algorithm for the principal component analysis of large data sets. SIAM J. Sci. Comput. 33(5), 2580–2594 (2011)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning, 1st edn. Springer, New York (2001)
Hellerstein, J., Re, C., Schoppmann, F., Wang, D.Z., et al.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)
Kanth, K.V.R., Agrawal, D., Singh, A.: Dimensionality reduction for similarity searching in dynamic databases. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD ’98), pp. 166–176. ACM, New York (1998)
Krycha, K., Wexler, J.: Sas and teradata in-database model development using the business example of churn modeling. In: Proc. SAS Global Forum (2013). Paper id 087
Kyriacou, C., Evripidou, P., Trancoso, P.: Data-driven multithreading using conventional microprocessors. IEEE Trans. Parallel Distrib. Syst. 17(10), 1176–1188 (2006)
Lahabar, S., Narayanan, P.J.: Singular value decomposition on GPU using CUDA. In: Proc. of IEEE International Symposium on Parallel & Distributed Processing, pp. 1–10 (2009)
Luo, C., Thakkar, H., Wang, H., Zaniolo, C.: A native extension of SQL for mining data streams. In: Proc. ACM SIGMOD Conference, pp. 873–875 (2005)
Navas, M., Ordonez, C.: Efficient computation of PCA with SVD in SQL. In: KDD Workshop on Data Mining Using Tensors and Matrices (2009), Article No. 5
Ordonez, C.: Statistical model computation with UDFs. IEEE Trans. Knowl. Data Eng. 22(12), 1752–1765 (2010)
Ordonez, C., Cereghini, P.: SQLEM: fast clustering in SQL using the EM algorithm. In: Proc. ACM SIGMOD Conference, pp. 559–570 (2000)
Ordonez, C., Mohanam, N., Garcia-Alvarado, C., Tosic, P.T., Martinez, E.: Fast PCA computation in a DBMS with aggregate UDFs and LAPACK. In: Proc. ACM CIKM Conference (2012)
Ries, F., DeMarco, T., Guerrieri, R.: Triangular matrix inversion on heterogeneous multicore systems. IEEE Trans. Parallel Distrib. Syst. 23(1), 177–184 (2012)
Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Acknowledgements
This work was supported by NSF grant IIS 0914861.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Feifei Li and Suman Nath.
Rights and permissions
About this article
Cite this article
Ordonez, C., Mohanam, N. & Garcia-Alvarado, C. PCA for large data sets with parallel data summarization. Distrib Parallel Databases 32, 377–403 (2014). https://doi.org/10.1007/s10619-013-7134-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-013-7134-6