Abstract
Clustering algorithms aim at gathering similar data points from a dataset in an unsupervised manner. Although the batch clustering algorithms have relatively high accuracy, they cannot make use of the dynamic clustering results efficiently. The requirement of using the whole dataset in calculation results in the problems of resource waste and high time cost. On the contrary, incremental clustering only needs to update the varied part of a model upon the arrival of new data, which makes it unnecessary to recluster the whole data all the time. The feature is very suitable for the streaming data process, but it decreases the accuracy of the algorithms and cannot satisfy the low latency requirement of real-time data processing. In response to this problem, the paper proposes a novel unified batch and streaming clustering model (UBSCM) based on streaming computation, which includes a streaming cluster feature updating mechanism (SCFUM). The Flink framework is used to implement a new streaming KMeans algorithm based on UBSCM (KMeansUBSP). The experiments on the real-world datasets validate that the new streaming KMeans algorithm is effective in clustering the batch and streaming data in a unified manner.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Saxena, A., Prasad, M., Gupta, A., et al.: A review of clustering techniques and developments. Neurocomputing 267, 664–681 (2017)
Nguyen, H.-L., Woon, Y.-K., Ng, W.-K.: A survey on data stream clustering and classification. Knowl. Inf. Syst. 45(3), 535–569 (2014). https://doi.org/10.1007/s10115-014-0808-1
Young, S., Arel, I., Karnowski, T.P., Rose, D.: A fast and stable incremental clustering algorithm. In: 2010 Seventh International Conference on Information Technology: New Generations, pp. 204–209 (2010)
Aggarwal, C.C., Philip, S.Y., Han, J., Wang, J.: A framework for clustering evolving data streams. In: Proceedings 2003 VLDB Conference, pp. 81–92 (2003)
Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339 (2006)
Apache Hadoop. https://hadoop.apache.org/. Accessed 04 June 2020
Apache Spark. https://Spark.apache.org/. Accessed 05 June 2020
Apache Flink. https://Flink.apache.org/. Accessed 09 June 2020
Chintapalli, S., Dagit, D., Evans, B., et al.: Benchmarking streaming computation engines: storm, Flink and spark streaming. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 1789–1792 (2016)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281–297 (1967)
Bagirov, A.M., Ugon, J., Webb, D.: Fast modified global k-means algorithm for incremental cluster construction. Pattern Recogn. 44(4), 866–876 (2011)
Pham, D.T., Dimov, S.S., Nguyen, C.D.: An incremental K-means algorithm. Proc. Inst. Mech. Eng. J. Mech. Eng. Sci. 218(7), 783–795 (2004)
Cao, F., Estert, M., Qian, W., et al.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339 (2006)
Apache Mahout. https://mahout.apache.org. Accessed 05 June 2020
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2000)
Bezdek, J.C.: Pattern recognition with fuzzy objective function algorithms. Springer, Boston (2013). https://doi.org/10.1007/978-1-4757-0450-1
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185 (2006)
Spark StreamingKMeans. https://Spark.apache.org/docs/latest/ml-guide.html. Accessed 08 June 2020
Alibaba Alink. https://github.com/alibaba/Alink. Accessed 08 June 2020
Marcu, O.C., Costan, A., Antoniu, G., Pérez-Hernández, M.S.: Spark versus Flink: understanding performance in big data analytics frameworks. In: 2016 IEEE International Conference on Cluster Computing, pp. 433–442 (2016)
UCI Machine Learning. https://archive.ics.uci.edu/ml/index.php. Accessed 07 May 2020
Zhou, Z.-H.: Machine Learning, 2nd edn. Tsinghua University Press, Beijing (2016)
Acknowledgements
This work is partly supported by the National Natural Science Foundation of China under Grant No. 61672159. No. 61672158, No. 62002063 and No. 61300104, the Fujian Collaborative Innovation Center for Big Data Applications in Governments, the Fujian Industry-Academy Cooperation Project under Grant No. 2017H6008 and No. 2018H6010, the Natural Science Foundation of Fujian Province under Grant No. 2019J01835 and No. 2020J01230054, and Haixi Government Big Data Application Cooperative Innovation Center.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, H., Guo, K., Chen, Y., Guo, H. (2021). The Design and Implementation of KMeans Based on Unified Batch and Streaming Processing. In: Sun, Y., Liu, D., Liao, H., Fan, H., Gao, L. (eds) Computer Supported Cooperative Work and Social Computing. ChineseCSCW 2020. Communications in Computer and Information Science, vol 1330. Springer, Singapore. https://doi.org/10.1007/978-981-16-2540-4_47
Download citation
DOI: https://doi.org/10.1007/978-981-16-2540-4_47
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-2539-8
Online ISBN: 978-981-16-2540-4
eBook Packages: Computer ScienceComputer Science (R0)