The Design and Implementation of KMeans Based on Unified Batch and Streaming Processing | SpringerLink
Skip to main content

The Design and Implementation of KMeans Based on Unified Batch and Streaming Processing

  • Conference paper
  • First Online:
Computer Supported Cooperative Work and Social Computing (ChineseCSCW 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1330))

  • 1162 Accesses

Abstract

Clustering algorithms aim at gathering similar data points from a dataset in an unsupervised manner. Although the batch clustering algorithms have relatively high accuracy, they cannot make use of the dynamic clustering results efficiently. The requirement of using the whole dataset in calculation results in the problems of resource waste and high time cost. On the contrary, incremental clustering only needs to update the varied part of a model upon the arrival of new data, which makes it unnecessary to recluster the whole data all the time. The feature is very suitable for the streaming data process, but it decreases the accuracy of the algorithms and cannot satisfy the low latency requirement of real-time data processing. In response to this problem, the paper proposes a novel unified batch and streaming clustering model (UBSCM) based on streaming computation, which includes a streaming cluster feature updating mechanism (SCFUM). The Flink framework is used to implement a new streaming KMeans algorithm based on UBSCM (KMeansUBSP). The experiments on the real-world datasets validate that the new streaming KMeans algorithm is effective in clustering the batch and streaming data in a unified manner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 16015
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 20019
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Saxena, A., Prasad, M., Gupta, A., et al.: A review of clustering techniques and developments. Neurocomputing 267, 664–681 (2017)

    Article  Google Scholar 

  2. Nguyen, H.-L., Woon, Y.-K., Ng, W.-K.: A survey on data stream clustering and classification. Knowl. Inf. Syst. 45(3), 535–569 (2014). https://doi.org/10.1007/s10115-014-0808-1

    Article  Google Scholar 

  3. Young, S., Arel, I., Karnowski, T.P., Rose, D.: A fast and stable incremental clustering algorithm. In: 2010 Seventh International Conference on Information Technology: New Generations, pp. 204–209 (2010)

    Google Scholar 

  4. Aggarwal, C.C., Philip, S.Y., Han, J., Wang, J.: A framework for clustering evolving data streams. In: Proceedings 2003 VLDB Conference, pp. 81–92 (2003)

    Google Scholar 

  5. Cao, F., Estert, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339 (2006)

    Google Scholar 

  6. Apache Hadoop. https://hadoop.apache.org/. Accessed 04 June 2020

  7. Apache Spark. https://Spark.apache.org/. Accessed 05 June 2020

  8. Apache Flink. https://Flink.apache.org/. Accessed 09 June 2020

  9. Chintapalli, S., Dagit, D., Evans, B., et al.: Benchmarking streaming computation engines: storm, Flink and spark streaming. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 1789–1792 (2016)

    Google Scholar 

  10. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281–297 (1967)

    Google Scholar 

  11. Bagirov, A.M., Ugon, J., Webb, D.: Fast modified global k-means algorithm for incremental cluster construction. Pattern Recogn. 44(4), 866–876 (2011)

    Article  Google Scholar 

  12. Pham, D.T., Dimov, S.S., Nguyen, C.D.: An incremental K-means algorithm. Proc. Inst. Mech. Eng. J. Mech. Eng. Sci. 218(7), 783–795 (2004)

    Article  Google Scholar 

  13. Cao, F., Estert, M., Qian, W., et al.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM International Conference on Data Mining, pp. 328–339 (2006)

    Google Scholar 

  14. Apache Mahout. https://mahout.apache.org. Accessed 05 June 2020

  15. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  16. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2000)

    Google Scholar 

  17. Bezdek, J.C.: Pattern recognition with fuzzy objective function algorithms. Springer, Boston (2013). https://doi.org/10.1007/978-1-4757-0450-1

  18. Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185 (2006)

    Google Scholar 

  19. Spark StreamingKMeans. https://Spark.apache.org/docs/latest/ml-guide.html. Accessed 08 June 2020

  20. Alibaba Alink. https://github.com/alibaba/Alink. Accessed 08 June 2020

  21. Marcu, O.C., Costan, A., Antoniu, G., Pérez-Hernández, M.S.: Spark versus Flink: understanding performance in big data analytics frameworks. In: 2016 IEEE International Conference on Cluster Computing, pp. 433–442 (2016)

    Google Scholar 

  22. UCI Machine Learning. https://archive.ics.uci.edu/ml/index.php. Accessed 07 May 2020

  23. Zhou, Z.-H.: Machine Learning, 2nd edn. Tsinghua University Press, Beijing (2016)

    Google Scholar 

Download references

Acknowledgements

This work is partly supported by the National Natural Science Foundation of China under Grant No. 61672159. No. 61672158, No. 62002063 and No. 61300104, the Fujian Collaborative Innovation Center for Big Data Applications in Governments, the Fujian Industry-Academy Cooperation Project under Grant No. 2017H6008 and No. 2018H6010, the Natural Science Foundation of Fujian Province under Grant No. 2019J01835 and No. 2020J01230054, and Haixi Government Big Data Application Cooperative Innovation Center.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuzhong Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, H., Guo, K., Chen, Y., Guo, H. (2021). The Design and Implementation of KMeans Based on Unified Batch and Streaming Processing. In: Sun, Y., Liu, D., Liao, H., Fan, H., Gao, L. (eds) Computer Supported Cooperative Work and Social Computing. ChineseCSCW 2020. Communications in Computer and Information Science, vol 1330. Springer, Singapore. https://doi.org/10.1007/978-981-16-2540-4_47

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-2540-4_47

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-2539-8

  • Online ISBN: 978-981-16-2540-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics