Abstract
We propose a mining framework that supports the identification of useful patterns based on incremental data clustering. Given the popularity of Web news services, we focus our attention on news streams mining. News articles are retrieved from Web news services, and processed by data mining tools to produce useful higher-level knowledge, which is stored in a content description database. Instead of interacting with a Web news service directly, by exploiting the knowledge in the database, an information delivery agent can present an answer in response to a user request. A key challenging issue within news repository management is the high rate of document insertion. To address this problem, we present a sophisticated incremental hierarchical document clustering algorithm using a neighborhood search. The novelty of the proposed algorithm is the ability to identify meaningful patterns (e.g., news events, and news topics) while reducing the amount of computations by maintaining cluster structure incrementally. In addition, to overcome the lack of topical relations in conceptual ontologies, we propose a topic ontology learning framework that utilizes the obtained document hierarchy. Experimental results demonstrate that the proposed clustering algorithm produces high-quality clusters, and a topic ontology provides interpretations of news topics at different levels of abstraction.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Aggarwal, C.C., Gates, S.C., Yu, P.S.: On the merits of using supervised clustering for building categorization systems. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)
Agirre, E., Ansa, O., Hovy, E., Martinez, D.: Enriching very large ontologies using the WWW. In: Proceedings of the ECAI Workshop on Ontology Learning (2000)
Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence database. In: Proceedings of International Conference of Foundations of Data Organization and Algorithms (1993)
Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic detection and tracking: pilot study final report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (1998)
Allan, J., Lavrenko, V., Jin, H.: First story detection in TDT is hard. In: Proceedings of the 9th ACM International Conference on Information and Knowledge Management (2000)
Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. ACM SIGMOD Record 19(2), 322–331 (1990)
Berchtold, S., Keim, D.A., Kreigel, H.P.: The X-tree: An index structure for high dimensional data. In: Proceedings of the 22nd International Conference on Very Large Data Bases (1996)
Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595 (1995)
Bradley, P.S., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1998)
Brants, T., Chen, F., Farahat, A.: A system for new event detection. In: Proceedings of the 26th International ACM SIGIR International Conference on Research and Development in Information Retrieval (2003)
Chan, K., Fu, A.W.: Efficient time series matching by wavelets. In: Proceedings of IEEE International Conference on Data Engineering (1999)
Chung, S., McLeod, D.: Dynamic topic mining from news stream data. In: Proceedings of the 2nd International Conference on Ontologies, Databases, and Application of Semantics for Large Scale Information Systems (2003)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2001)
Dunlavy, D.M., Conroy, J., O’Leary, D.P.: QCS: a tool for querying, clustering, and summarizing documents. In: Proceedings of Human Language Technology Conference (2003)
Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 3rd SIAM International Conference on Data Mining (2003)
Fayyad, U.M., Reina, C., Bradley, P.S.: Initialization of iterative refinement clustering algorithms. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1998)
Glover, E.J., Pennock, D.M., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management (2002)
Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1998)
Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. In: Proceedings of the 15th International Conference on Data Engineering (1999)
Guttman, A.: R-Trees: A dynamic index structure for spatial searching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1985)
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers, San Francisco (2000)
Hatzivassiloglou, V., Gravano, L., Maganti, A.: An investigation of linguistic features and clustering algorithms for topical document clustering. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2000)
Huber, P.J.: Robust Statistics. Wiley, New York (1981)
Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers C22, 1025–1034 (1973)
Joachims, T., Cristianini, N., Shawe-Taylor, J.: Composite kernels for hypertext categorisation. In: Proceedings of the 18th International Conference on Machine Learning (2001)
Karypis, G., Han, E.H., Kumar, V.: CHAMELEON: a hierarchical clustering algorithm using dynamic modeling. IEEE Computer 32(8), 68–75 (1999)
Khan, L., McLeod, D.: Effective retrieval of audio information from annotated text using ontologies. In: Proceedings of ACM SIGKDD Workshop on Multimedia Data Mining (2000)
Khan, L., McLeod, D.: Disambiguation of annotated text of audio using onologies. In: Proceeding of ACM SIGKDD Workshop on Text Mining (2000)
Khan, L., McLeod, D., Hovy, E.H.: Retrieval effectiveness of an ontology-based model for information selection. The VLDB Journal 13(1), 71–85 (2004)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)
Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proceedings of the 25th ACM SIGIR International Conference on Research and Development in Information Retrieval (2002)
Maedche, A., Staab, S.: Ontology learning for the Semantic Web. IEEE Intelligent Systems 16(2), 72–79 (2001)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2000)
McKeown, K.R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J.L., Nenkova, A., Sable, C., Schiffman, B., Sigelman, S.: Tracking and summarizingnews on a daily basis with Columbia’s Newsblaster. In: Proceedings of the Human Language Technology Conference (2002)
Melamed, I.D.: Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In: Proceedings of the 3rd Workshop on Very LargeCorpora (1995)
Miller, G.: Wordnet: An on-line lexical database. International Journal of Lexicography 3(4), 235–312 (1990)
Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th International Conference on Very Large Data Bases (1994)
Noy, N.F., Sintek, M., Decker, S., Crubezy, M., Fergerson, R.W., Musen, M.A.: Creating Semantic Web contents with Protégé-2000. IEEE Intelligent Systems 6(12), 60–71 (2001)
Pelleg, D., Moore, A.: X-means: Extending K-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning (2000)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Radev, D.R., Goldensohn, S., Zhang, Z., Raghavan, R.S.: Newsinessence: a system for domain-independent, real-time news clustering and multi-document summarization. In: Proceedings of Human Language Technology Conference (2001)
Radev, D.R., Goldensohn, S., Zhang, Z., Raghavan, R.S.: Interactive, domainindependent identification and summarization of topically related news. In: Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries (2001)
Ralaivola, L., d’Alch´e-Buc, F.: Incremental support vector machine learning: a local approach. In: Proceedings of the Annual Conference of the European Neural Network Society (2001)
Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
Sanderson, M., Croft, W.B.: Deriving concept hierarchies from text. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999)
Song, D., Bruza, P.D.: Towards context sensitive information inference. Journal of the American Society for Information Science and Technology 54(4), 321–334 (2003)
Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval (1994)
Yang, Y., Carbonell, J., Brown, R., Pierce, T., Archibald, B.T., Liu, X.: Learning approaches for detecting and tracking news events. IEEE Intelligent Systems 14(4), 32–43 (1999)
Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection.In. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002)
Zadeh, L.A.: Similarity relations and fuzzy orderings. Information Sciences 3(2), 177–200 (1971)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD InternationalConference on Management of Data (1996)
Zhao, Y., Karypis, G.: Evaluations of hierarchical clustering algorithms for document datasets. In: Proceedings of the 11th ACM International Conference on Information and Knowledge Management (2002)
Nist topic detection and tracking corpus (1998), http://www.nist.gov/speech/tests/tdt/tdt98/index.htm
Protégé WordNet tab, http://protege.stanford.edu/plugins/wordnettab/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Chung, S., McLeod, D. (2005). Dynamic Pattern Mining: An Incremental Data Clustering Approach. In: Spaccapietra, S., et al. Journal on Data Semantics II. Lecture Notes in Computer Science, vol 3360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30567-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-30567-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24208-6
Online ISBN: 978-3-540-30567-5
eBook Packages: Computer ScienceComputer Science (R0)