Abstract
Fueled by advances in computer technology and online business, data collection is rapidly accelerating, as well as the importance of its analysis — data mining. Increasing database sizes strain the scalability of many data mining algorithms. Data clustering is one of the fundamental techniques in data mining solutions. The many clustering algorithms developed face new challenges with growing data sets. Algorithms with quadratic or higher computational complexity, such as agglomerative algorithms, drop out quickly. More efficient algorithms, such as K-Means EM with linear cost per iteration, still need work to scale up to large data sets. This paper shows that many parameter estimation algorithms, including K-Means, K-Harmonic Means and EM, can be recast without approximation in terms of Sufficient Statistics, yielding an superior speed-up efficiency. Estimates using today’s workstations and local area network technology suggest efficient speed-up to several hundred computers, leading to effective scale-up for clustering hundreds of gigabytes of data. Implementation of parallel clustering has been done in a parallel programming language, ZPL. Experimental results show above 90% utilization.
Chapter PDF
Similar content being viewed by others
Keywords
References
Apostolakis, J., Coddington, P., and Marinari, E. “New SIMD algorithms for cluster labeling on parallel computers,” Intl J. of Modern Physics C, 4(4):749–763, Aug. 1993.
Baillie, C.F., Coddington, P.D. “Cluster identification algorithms for spin models,” Concurrency:Practice and Experience,3(2):129–144, April 1991. Caltech Report C3P-855.
Bradley, P., Fayyad, U. M., and Reina, C.A., “Scaling EM Clustering to Large Databases,” MS Technical Report, 1998.
Bradley, P., Fayyad, U. M., C.A., “Refining Initial Points for KM Clustering”, MS Technical Report MSR-TR-98-36, May 1998.
Dempster, A. P., Laird, N.M., and Rubin, D.B., “Miximum Likelyhood from Incomplete Data via the EM Algorithm,” J. of the Royal Stat. Soc., Series B, 39(1):1–38, 1977.
T. J. Fountain, “Parallel Computing: Principles and Practice,” Cambridge Univ Pr 1994.
Forman, G., Zhang, B., “Linear speedup for a parallel non-approximate recasting of centerbased clustering algorithms, including K-Means, K-Harmonic Means, and EM”, To appear in DPKD 2000, Boston.
Gersho & Gray, “Vector Quantization and Signal Compression,” KAP, 1992
Vipin Kumar, “troduction to Parallel Computing,” Addison-Wesley, 1994.
Sanpawat Kantabutra and Alva L. Couch, “Parallel K-means Clustering Algorithm on NOWs,” NECTEC Technical journal (http://www.nectec.or.th)
MacQueen, J., “Some methods for classification and analysis of multivariate observations,” Pp. 281–297, Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, Vol. 1. University of California Press, Berkeley. xvii + 666 p, 1967.
McKenzie, P. and Alder, M., “Initializing the EM Algorithm for Use in Gaussian Mixture Modeling,” The Univ. of W. Australia, Center for Information Processing Systems, Manuscript.
McLachlan, G., Krishnan, T., “The EM algorithm and extensions,” John Wiley & Sons.
P. Resnick and H. Varian, “Recommender Systems,” in CACM March 1997.
Snyder, L., “A Programmer’s Guide to ZPL (Scientific and Engineering Computation Series)”, MIT Press; ISBN: 0262692171, 1999.
Selim, S.Z. and Ismail, M.A., “K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality,” IEEE Trans. On PAMI-6, #1, 1984.
Zhang, B., Hsu, M., Dayal, U., “K-Harmonic Means-A Data Clustering Algorithm,” Hewllet-Packard Research Laboratory Technical Report HPL-1999-124.
Zhang, T., Ramakrishnan, R., and Livny, M., “BIRCH: an efficient data clustering method for very large databases,” ACM SIGMOD Record, Vol. 25, No. 2 (June 1996), Pages 103-114 in: SIGMOD ’96.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, B., Hsu, M., Forman, G. (2000). Accurate Recasting of Parameter Estimation Algorithms Using Sufficient Statistics for Efficient Parallel Speed-Up. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2000. Lecture Notes in Computer Science(), vol 1910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45372-5_24
Download citation
DOI: https://doi.org/10.1007/3-540-45372-5_24
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41066-9
Online ISBN: 978-3-540-45372-7
eBook Packages: Springer Book Archive