Abstract
Nowadays, with the advance of technology, many applications generate huge amounts of data streams at very high speed. Examples include network traffic, web click streams, video surveillance, and sensor networks. Data stream mining has become a hot research topic. Its goal is to extract hidden knowledge/patterns from continuous data streams. Unlike traditional data mining where the dataset is static and can be repeatedly read many times, data stream mining algorithms face many challenges and have to satisfy constraints such as bounded memory, single-pass, real-time response, and concept-drift detection. This paper presents a comprehensive survey of the state-of-the-art data stream mining algorithms with a focus on clustering and classification because of their ubiquitous usage. It identifies mining constraints, proposes a general model for data stream mining, and depicts the relationship between traditional data mining and data stream mining. Furthermore, it analyzes the advantages as well as limitations of data stream algorithms and suggests potential areas for future research.








Similar content being viewed by others
References
Adomavicius G, Bockstedt J (2008) C-trend: temporal cluster graphs for identifying and visualizing trends in multiattribute transactional data. IEEE Trans Knowl Data Eng 20:721–735
Aggarwal CC (2009) Data streams: an overview and scientific applications. Springer, Berlin, pp 377–397
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, vol 29, pp 81–92. VLDB Endowment
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 13th international conference on very large data bases, vol 30, pp 852–863, 1316763. VLDB Endowment
Aggarwal CC, Han J, Wang J, Yu PS (2006) A framework for on-demand classification of evolving data streams. IEEE Trans Knowl Data Eng 18(5):577–589
Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196
Aggarwal CC, Subbian K (2012) Event detection in social streams. In: Proceedings of the SIAM international conference on data mining, pp 624–635
Aggarwal CC, Yu P (2005) Online analysis of community evolution in data streams. In: Proceedings of the SIAM international conference on data mining, pp 56–67
Aksoy C, Can F, Kocberber S (2012) Novelty detection for topic tracking. J Am Soc Inf Sci Technol 63(4):777–795
Alahakoon D, Halgamuge SK, Srinivasan B (2000) Dynamic self-organizing maps with controlled growth for knowledge discovery. IEEE Trans Neural Netw 11(3):601–614
Alexander H, Er H, Daniel AK (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceeding of the 1998 international conference knowledge discovery and data mining, pp 58–65
Allan J, Papka R, Lavrenko V (1998) On-line new event detection and tracking. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 37–45. ACM
Amigo E, Gonzalo J, Artiles J (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12(4):461–486
Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, vol 28, pp 49–60, 304187. ACM
Beckmann N, Kriegel H-P, Schneider R, Seeger B (1990) The R*-tree: an efficient and robust access method for points and rectangles, vol 19. ACM
Bifet A (2010) Adaptive stream mining: pattern learning and mining from evolving data streams. In: Proceeding of the 2010 conference on adaptive stream mining: pattern learning and mining from evolving data streams. IOS Press
Bifet A, Holmes G, Kirkby R (2012) Moa: massive online analysis. J Mach Learn Res 11:1601–1604
Bohm C, Kailing K, Kriegel H-P, Kroger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the 4th IEEE international conference on data mining, pp 27–34, 1033433. IEEE Computer Society
Can F (1993) Incremental clustering for dynamic information processing. ACM Trans Inf Syst (TOIS) 11(2):143–164
Can F, Kocberber S, Baglioglu O, Kardas S, Ocalan HC, Uyar E (2010) New event detection and topic tracking in turkish. J Am Soc Inf Sci Technol 61(4):802–819
Can F, Ozkarahan EA (1990) Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Trans Database Syst (TODS) 15(4):483–517
Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining, pp 328–339
Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 554–560. ACM
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 133–142, 1281210. ACM
Chi Y, Song X, Zhou D, Hino K, Tseng B (2007) Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 153–162. ACM
Chi Y, Song X, Zhou D, Hino K, Tseng B (2009) On evolutionary spectral clustering. ACM Trans Knowl Discov Data 3(4):1–30
Chien S, Immorlica N (2005) Semantic similarity between search engine queries using temporal correlation. In: Proceedings of the 14th international conference on World Wide Web, pp 2–11, 1060752. ACM
Chow TWS, Sitao W (2004) An online cellular probabilistic self-organizing map for static and dynamic data sets. IEEE Trans Circuits Syst I Regul Pap 51(4):732–747
Chris G, Jiaweihan Y, Jianpei Z, Xifeng Yan Y, Philip SY (2003) Mining frequent patterns in data streams at multiple time granularities. Next Gener Data Min 212:191–212
Ciaccia P, Patella M, Zezula P (1997) M-tree: An efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd international conference on very large data bases, pP 426–435, 671005. Morgan Kaufmann Publishers Inc
Cui W, Liu S, Tan L, Shi C, Song Y, Gao Z, Qu H, Tong X (2011) Textflow: Towards better understanding of evolving topics in text. IEEE Trans Vis Comput Graph 17(12):2412–2421
Dang X, Lee V, Ng W, Ciptadi A, Ong K (2009) An EM-based algorithm for clustering data streams in sliding windows, vol 5463 of Lecture notes in computer science, pp 230–235. Springer, Berlin
de Leeuw J (2005) Applications of convex analysis to multidimensional scaling. North Holland Publishing Company, Amsterdam, pp 133–146
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 71–80, 347107. ACM
Dork M, Gruen D, Williamson C, Carpendale S (2010) A visual backchannel for large-scale events. IEEE Trans Vis Comput Graph 16(6):1129–1138
Fung BC, Wang K, Yu PS (2005) Top-down specialization for information and privacy preservation. In: 21st international conference on data engineering, 2005. ICDE 2005. Proceedings, pp 205–216. IEEE
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Rec 34(2):18–26
Gama J (2013) Data stream mining: the bounded rationality. Informatica (Slovenia) 37(1):21–25
Gama J, Gaber MM (2007) Learning from data streams—processing techniques in sensor networks. Springer, Berlin
Gama J, Pinto C (2006) Discretization from data streams: applications to histograms and data mining. In: Proceedings of the 2006 ACM symposium on applied computing, pp 662–667. ACM
Gama J, Rodrigues P (2009) An overview on mining data streams, vol 206 of Studies in computational intelligence, pp 29–45. Springer, Berlin
Gama J, Sebastião R, Rodrigues P (2013) On evaluating stream learning algorithms. Mach Learn 90(3):317–346
Guha S, Kim C, Shim K (2004) Xwave: optimal and approximate extended wavelets. In: Proceedings of the thirtieth international conference on very large data bases, vol 30, pp 288–299, 1316716. VLDB Endowment
Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15:515–528
Guha S, Mishra N, Motwani R, O’Callaghan L (2000) Clustering data streams. In: Proceedings of the 41st annual symposium on foundations of computer science, pp 359–366
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, pp 73–84, 276312. ACM
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Proceedings of the 15th international conference on data engineering, p 512, 847264. IEEE Computer Society
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data, pp 47–57
Hahsler M, Dunham M (2011) Temporal structure learning for clustering massive data streams in real-time. In: SIAM conference on data mining, pp 664–675. SIAM
Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco
He Q, Chang K, Lim E-P, Zhang J (2007) Bursty feature representation for clustering text streams. In: SIAM international conference on data mining
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 97–106, 502529. ACM
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Karypis G, Eui-Hong H, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis, vol 39. Wiley, New York
Kelly MG, Hand DJ, Adams NM (1999) The impact of changing populations on classifier performance. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 367–371, 312285. ACM
Kleinberg J (2003) Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4):373–397
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Kranen P, Assent I, Baldauf C, Seidl T (2011) The clustree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst 29(2):249–272
Kranen P, Günnemann S, Fries S, Seidl T (2010) MC-tree: improving Bayesian anytime classification, vol 6187 of Lecture notes in computer science, pp 252–269. Springer, Berlin
Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B (2011) An effective evaluation measure for clustering on evolving data streams. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 868–876, 2020555. ACM
Kriegel H-P, Kröger P, Ntoutsi I, Zimek A (2011) Density based subspace clustering over dynamic data, vol 6809 of Lecture notes in computer science, pp 387–404. Springer, Berlin
Lam W, Meng H, Wong K, Yen J (2001) Using contextual analysis for news event detection. Int J Intell Syst 16(4):525–546
Last M (2002) Online classification of nonstationary data streams. Intell Data Anal 6(2):129–147
Lei Y, Huan L (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: The 20th international conference on machine learning, pp 856–863
Leite D, Costa P, Gomide F (2010) Evolving granular neural network for semi-supervised data stream classification. In: The 2010 international joint conference on neural networks (IJCNN), pp 1–8. IEEE
Leite DF, Costa P, Gomide F (2009) Evolving granular classification neural networks. In: International joint conference on neural networks, 2009. IJCNN 2009, pp 1736–1743
Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502
Martin E, Hans-Peter K, Jörg S, Xiaowei X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231. AAAI Press
Muthukrishnan SM (2003) Data streams: algorithms and applications. Now Publishers Inc
Narasimhamurthy A, Kuncheva LI (2007) A framework for generating data to simulate changing environments. In: Proceedings of the 25th IASTED international multi-conference: artificial intelligence and applications, vol 549, p 389. ACTA Press
Nguyen H-L, Woon Y-K, Ng W-K, Wan L (2012) Heterogeneous ensemble for feature drifts in data streams. In: Proceedings of the 16th Pacific-Asia conference on advances in knowledge discovery and data mining, vol 2, pp 1–12, 2342648. Springer
O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering, pp 685–694
Oza NC (2005) Online bagging and boosting. In: 2005 IEEE international conference on systems, man and cybernetics, vol 3, pp 2340–2345. IEEE
Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Rec 33(1):32–37
Park NH, Lee WS (2007) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. Data Knowl Eng 63(2):528–549
Rai P, Daum H, Venkatasubramanian S (2009) Streamed learning: one-pass svms. In: Proceedings of the 21st international jont conference on artificial intelligence, pp 1211–1216, 1661639. Morgan Kaufmann Publishers Inc
Rodrigues P, Gama J, Pedroso J (2006) Odac: hierarchical clustering of time series data streams. In: SIAM international conference on data mining, Bethesda, Maryland, USA, pp 499–503
Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on World wide web, pp 851–860. ACM
Sattar H, Ying Y, Zahra M, Mohammadreza K (2009) Adapted one-vs-all decision trees for data stream classification. IEEE Trans Knowl Data Eng 21:624–637
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Seidl T, Assent I, Kranen P, Krieger R, Herrmann J (2009) Indexing density models for incremental learning and anytime classification on data streams. In: Proceedings of the 12th international conference on extending database technology: advances in database technology, pp 311–322, 1516397. ACM
Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304
Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACd, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46(1):13
Smith T, Alahakoon D (2009) Growing self-organizing map for online continuous clustering, vol 204 of Studies in computational intelligence, pp 49–83. Springer, Berlin
Sow D, Biem A, Blount M, Ebling M, Verscheure O (2010) Body sensor data processing using stream computing. In: Proceedings of the international conference on multimedia information retrieval, pp 449–458, 1743465. ACM
Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) Monic: modeling and monitoring cluster transitions. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 706–711. ACM
Street WN, Kim Y (2001) A streaming ensemble algorithm (sea) for large-scale classification. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 377–382, 502568. ACM
Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 687–696. ACM
Tasoulis DK, Ross G, Adams NM (2007) Visualising the cluster structure of data streams. In: Proceedings of the 7th international conference on intelligent data analysis, pp 81–92, 1771633. Springer
Teh Y, Jordan M, Beal M, Blei D (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101(476):1566–1581
Tsang IW, Kocsor A, Kwok JT (2007) Simpler core vector machines with enclosing balls. In: Proceedings of the 24th international conference on machine learning, pp 911–918, 1273611. ACM
Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering, vol 4632 of Lecture notes in computer science, pp 605–615. Springer, Berlin
van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, Massachusetts
Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data (TKDD) 3(3):1–28
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 226–235, 956778. ACM
Wang H, Yu PS, Han J (2005) Mining data streams. Springer, US, pp 777–792
Wang W, Yang J, Muntz R (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the international conference on very large data bases, pp 186–195
Yang Y, Pierce T, Carbonell J (1998) A study of retrospective and on-line event detection. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 28–36. ACM
Yang Y, Zhang J, Carbonell J, Jin C (2002) Topic-conditioned novelty detection. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 688–693. ACM
Zeng H-J, He Q-C, Chen Z, Ma W-Y, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 210–217, 1009030. ACM
Zhang P, Gao B, Zhu X, Guo L (2011) Enabling fast lazy learning for data streams. In: Proceedings of the 11th IEEE international conference on data mining (ICDM-11), pp 932–941. IEEE
Zhang P, Li J, Wang P, Gao BJ, Zhu X, Guo L (2011) Enabling fast prediction for ensemble models on data streams. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 177–185, 2020442. ACM
Zhang P, Zhu X, Shi Y, Guo L, Wu X (2011) Robust ensemble learning for mining noisy data streams. Decis Support Syst 50(2):469–479
Zhang P, Zhu X, Shi Y, Wu X (2009) An aggregate ensemble for mining concept drifting data streams with noise, vol 5476 of Lecture notes in computer science, pp 1021–1029. Springer, Berlin
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, pp 103–114, 233324. ACM
Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5):790–798
Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214
Acknowledgments
This work is partially supported by the Wattalyzer grant project of I2R, Singapore with Grant reference: NRF2012EWT-EIRP002-044.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nguyen, HL., Woon, YK. & Ng, WK. A survey on data stream clustering and classification. Knowl Inf Syst 45, 535–569 (2015). https://doi.org/10.1007/s10115-014-0808-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-014-0808-1