Abstract
Mining streaming data has been an active research area to address requirements of applications, such as financial marketing, telecommunication, network monitoring, and so on. A popular technique for mining these continuous and fast-arriving data streams is decision trees. The accuracy of decision trees can deteriorate if the distribution of values in the stream changes over time. In this paper, we propose an approach based on decision trees that can detect distribution changes and re-align the decision tree quickly to reflect the change. The technique exploits a set of synopses on the leaf nodes, which are also used to prune the decision tree. Experimental results demonstrate that the proposed approach can detect the distribution changes in real-time with high accuracy, and re-aligning a decision tree can improve its performance in clustering the subsequent data stream tuples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aggarwal, C.: A framework for diagnosing changes in evolving data streams. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 575–586 (2003)
Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proc. 29th Int. Conf. on Very Large Data Bases, pp. 81–92 (2003)
Chakravarti, I., Laha, R., Roy, J.: Handbook of Methods of Applied Statistics. John Wiley and Sons, Chichester (1967)
Charikar, M., Chen, K., Motwani, R.: Incremental clustering and dynamic information retrieval. In: Proc. ACM Symp. on Theory of Computing, pp. 626–635 (1997)
Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: Proc. ACM Symp. on Theory of Computing, pp. 30–39 (2003)
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 71–80 (2000)
Fan, W., Huang, Y., Yu, P.: Decision tree evolution using limited number of labeled data items from drifting data streams. In: Proc. 2004 IEEE Int. Conf. on Data Mining, pp. 379–382 (2004)
Fredman, M.: Two applications of a probabilistic search technique: Sorting x + y and building balanced search tree. In: Proc. ACM Symp. on Theory of Computing, pp. 240–244 (1975)
Gaber, M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: A review. ACM SIGMOD Record 34(2), 18–26 (2005)
Gama, J., Medas, P., Rodrigues, P.: Learning decision trees from dynamic data streams. In: Proc. 2005 ACM Symp. on Applied Computing, pp. 573–577 (2005)
Gama, J., Rocha, R., Medas, P.: Accurate decision tree for mining high-speed data streams. In: Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 523–528 (2003)
Guha, S., Meyerson, A., Mishra, N., Motwani, R.: Clustering data streams: Theory and practice. IEEE Trans. Knowledge and Data Eng. 15(3), 515–528 (2003)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 18–30 (1963)
Hulten, G., Spencer, L., Domingos, P.: Mining time-chaning data streams. In: Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 97–106 (2001)
Jin, R., Aggrawal, G.: Efficient decision tree constructions on streaming data. In: Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 571–576 (2003)
Kaufman, L., Rousseeuw, P.: Finding groups in data: An introduction to cluster analysis. Addison-Wesley, Reading (1990)
Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Proc. 30th Int. Conf. on Very Large Data Bases, pp. 180–191 (2004)
Knuth, D.: Optimum binary search trees. Acta Informatica 1, 14–25 (1971)
Knuth, D.: The art of computer programming 3: Sorting and searching. Addison-Wesley, Reading (1973)
Babock, B., et al.: Models and issues in data stream systems. In: Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, pp. 1–16 (2002)
Abadi, D., et al.: The design of the borealis stream processing engine. In: Proc. 2nd Biennial Conf. on Innovative Data Systems Research (2005)
Li, J., et al.: Semantics and evaluation techniques for window aggregates in data streams. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 311–322 (2005)
Chen, M., et al.: Path-based failure and evolution management. In: 1st Symposium on Network Systems Design and Implementation, pp. 309–322 (2004)
Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 226–235 (2003)
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin 1, 80–83 (1945)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Tao, Y., Özsu, M.T. (2010). Efficient Decision Tree Re-alignment for Clustering Time-Changing Data Streams. In: Sachs, K., Petrov, I., Guerrero, P. (eds) From Active Data Management to Event-Based Systems and More. Lecture Notes in Computer Science, vol 6462. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17226-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-17226-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17225-0
Online ISBN: 978-3-642-17226-7
eBook Packages: Computer ScienceComputer Science (R0)