Abstract
In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present different challenges because of the large and un-ordered nature of the corresponding attributes. Therefore, we will propose algorithms for text and categorical data stream clustering. We will propose a condensation based approach for stream clustering which summarizes the stream into a number of fine grained cluster droplets. These summarized droplets can be used in conjunction with a variety of user queries to construct the clusters for different input parameters. Thus, this provides an online analytical processing approach to stream clustering. We also study the problem of detecting noisy and outlier records in real time. We will test the approach for a number of real and synthetic data sets, and show the effectiveness of the method over the baseline OSKM algorithm for stream clustering.
Similar content being viewed by others
References
Agrawal D (2007) Detecting anomalies in cross-classified streams: a Bayesian approach. KAIS J 11(1): 29–44
Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. ICDE conference
Aggarwal CC, Procopiuc C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62. http://dblp.uni-trier.de/rec/bibtex/journals/tkde/AggarwalPY02
Aggarwal CC (2003) A framework for diagnosing changes in evolving data streams. ACM SIGMOD conference
Aggarwal CC, Han J, Wang J, Yu P (2003) A framework for clustering evolving data streams. VLDB conference
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. VLDB conference
Allan J, Carbonell J, Doddington G, Yamron J, Yang Y (1998) Topic detecting and tracking pilot study final report. In: Proceedings of the broadcast news understanding and transcription workshop
Allan J, Papka R, Lavrenko V (1998) On-line new event detection and tracking. ACM SIGIR conference, pp 37–45
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. ACM PODS conference
Banerjee A, Ghosh J (2004) Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Netw 15: 702–719
Banerjee A, Basu S (2007) Topic models over text streams: a study of batch and online unsupervised learning. ICML conference
Bradley P, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. SIGKDD conference
Cao B, Ester M, Qian W, Zhou A (2006) Density based clustering of evolving data stream with noise. SIAM data mining conference
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. KDD conference
Cutting D, Karger D, Pedersen J, Tukey J (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the SIGIR, pp 318–329
Domingos P, Hulten G (2000) Mining high-speed data streams. ACM SIGKDD conference
Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2: 139–172
Franz M, Ward T, Scott McCarley J, Zhu W-J (2001) Unsupervised and supervised clustering for topic tracking. SIGIR conference
Gennari JH, Langley P, Fisher D (1989) Models of incremental concept formation. J Artif Intell 40: 11–61
Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: an approach based on dynamical systems. In: Proceedings of the VLDB conference
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the international conference on data engineering
He Q, Chang K, Lim EP, Zhang J (2007) Bursty feature representation for clustering text streams. SDM conference
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large data sets. Proceedings of the VLDB conference
Li Y, Gopalan R (2006) Clustering transactional data streams. Adv Artif Intell, pp 1069–1073. http://dblp.uni-trier.de/rec/bibtex/conf/ausai/LiG06
Ng R, Han J (1994) Efficient and Effective Clustering Methods for Spatial Data Mining. Very Large Data Bases Conference
O’Callaghan L et al (2002) Streaming-data algorithms for high-quality clustering. ICDE conference
Peterson GL, McBride BT (2008) The importance of generalizability for anomaly detection. KAIS J 14(3): 377–392
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the ACM SIGMOD conference
Silverstein C, Pedersen J (1997) Almost-constant time clustering of arbitrary corpus sets. In: Proceedings of the ACM SIGIR, pp 60–66
Surendran A, Sra S (2006) Incremental aspect models for mining document streams. Principles Knowl Discov Data Mining (PKDD), pp 633–640. http://dblp.uni-trier.de/rec/bibtex/conf/pkdd/SurendranS06
Yang Y, Pierce T, Carbonell J (1998) A study on retrospective and on-line event detection. In: Proceedings of the SIGIR conference
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD conference
Zhang J, Ghahramani Z, Yang Y (2005) A probabilistic model for online document clustering with application to novelty detection. In: Saul L, Weiss Y, Bottou L (eds) Advances in neural information processing letters, vol 17
Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6), pp 790–798. http://dblp.uni-trier.de/rec/bibtex/journals/nn/Zhong05
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aggarwal, C.C., Yu, P.S. On clustering massive text and categorical data streams. Knowl Inf Syst 24, 171–196 (2010). https://doi.org/10.1007/s10115-009-0241-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0241-z