Abstract
Extraction of sequences of events from news and other documents based on the publication times of these documents has been shown to be extremely effective in tracking past events. This paper addresses the issue of constructing an optimal information preserving decomposition of the time period associated with a given document set, i.e., a decomposition with the smallest number of subintervals, subject to no loss of information. We introduce the notion of the compressed interval decomposition, where each subinterval consists of consecutive time points having identical information content. We define optimality, and show that any optimal information preserving decomposition of the time period is a refinement of the compressed interval decomposition. We define several special classes of measure functions (functions that measure the prevalence of keywords in the document set and assign them numeric values), based on their effect on the information computed as document sets are combined. We give algorithms, appropriate for different classes of measure functions, for computing an optimal information preserving decomposition of a given document set. We studied the effectiveness of these algorithms by computing several compressed interval and information preserving decompositions for a subset of the Reuters–21578 document set. The experiments support the obvious conclusion that the temporal information gleaned from a document set is strongly dependent on the measure function used and on other user-defined parameters.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Available at http://kdd.ics.uci.edu.
Each \( \Pi_C^{w_p} \) is a partition of T D into subintervals. The product of a set of partitions is the coarsest partition that is a refinement of each of the individual partitions, and is unique (Rosen, 2003).
Available at http://ldc.upenn.edu.
References
Agrawal, R., Psaila, G., Wimmers, E., and Zait, M. 1995. Querying shapes of histories. In Proc. of 21st International Conference on Very Large Databases.
Allan, J., Carbonell, J.G., Doddington, G., Yamron, J., and Yang, Y. 1998. Topic detection and tracking pilot study: Final report. In Proc. DARPA Broadcast News Transcription and Understanding Workshop.
Allan, J., Lavrenko, V., Malin, D., and Swan, R. 2000. Detections, bounds, and timelines: UMass and TDT-3. In Proc. 3rd Topic Detection and Tracking Workshop.
Berndt. D.J. and Clifford, J. 1996. Finding patterns in time Series: A dynamic programming approach. In U. M. Fayyad et al. Advances in Knowledge Discovery and Data Mining, AAAI Press.
Chundi, P. and Rosenkrantz, D.J. 2004. On lossy time decompositions of time stamped documents. In Proc. of the 2004 ACM International Conference on Information and Knowledge Management, pp. 437–445.
Cormen, T.H., Leiserson, C.E., Rivest, R.L. and Stein, C. 2001. Introduction to Algorithms, 2nd Edition. MIT Press.
Dougherty, J. Kohavi, R., and Sahami, M. 1995. Supervised and unsupervised discretization of continuous features. In Proc. of the 12th International Conference on Machine Learning, pp. 194–202.
Kleinberg, J. 2002. Bursty and hierarchical structure in streams. In Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp 91–101.
Lent, B., Agrawal, R., and Srikant, R. 1997. Discovering trends in text databases. In Proc. 3rd Intl. Conf. Knowledge Discovery and Data Mining (KDD), pp. 227–230.
Motoyoshi, M., Miura, T., and Watanabe, K. 2002. Mining temporal classes from time series data. In Proc. 11th Intl. Conf. on Information and Knowledge Management, pp. 493–498.
Rosen, K. 2003. Discrete mathematics and its applications. 5th edition. McGraw Hill College Division.
Roy, S., Gevry, D., and Pottenger, W. M. 2002. Methodologies for trend detection in textual data mining. In Proc. Textmine 02 Workshop, SIAM Intl. Conf. on Data Mining.
Chakravarti, S., Sarawagi, S., and Dom, B. 1998. Mining surprising patterns using temporal description length, In Proc. of the 24th Int'l Conference on Very Large Databases (VLDB).
Swan, R. and Allan, J. 1999. Extracting significant time varying features from text. In Proc. 8th Intl. Conf. on Information and Knowledge Management, pp. 38–45.
Swan, R. and Allan, J. 2000. Automatic generation of overview timelines. In Proc. 23rd Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp.49–56.
Swan, R. and Jensen, D. 2000. Timemines: Constructing timelines with statistical models of word usage. In Proc. KDD 2000 Workshop on Text Mining.
Acknowledgments
We thank the anonymous referees for their detailed and helpful comments which greatly improved the readability of the paper.
This publication was made possible by NIH grant number P20 RR16469 from the INBRE program of the National Center for Research Resources. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIH.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this paper was presented at the 2004 SIAM International Conference on Data Mining.
†Research supported by NSF Grant CCR–0105536.
Rights and permissions
About this article
Cite this article
Chundi, P., Rosenkrantz, D.J. Information Preserving Time Decompositions of Time Stamped Documents* . Data Min Knowl Disc 13, 41–65 (2006). https://doi.org/10.1007/s10618-005-0035-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-005-0035-1