Abstract
Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. We present a method that can semi-automatically enhance a wide class of existing learning algorithms so they can learn from such high-speed data streams in real time. The method works by sampling just enough data from the data stream to make each decision required by the learning process. The method is applicable to essentially any induction algorithm based on discrete search. In this chapter, we illustrate the use of our method by applying it to what is perhaps the most widely used form of data mining: decision tree induction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees (Wadsworth, Belmont, 1984)
J. Catlett, Megainduction: machine learning on very large databases. PhD thesis, Basser Department of Computer Science, University of Sydney, Sydney, Australia (1991)
C. Domingo, R. Gavalda, O. Watanabe, Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Min. Knowl. Discov. 6, 131–152 (2002)
P. Domingos, G. Hulten, A general method for scaling up machine learning algorithms and its application to clustering, in Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA (Morgan Kaufmann, San Mateo, 2001), pp. 106–113
P. Domingos, G. Hulten, Learning from infinite data in finite time, in Advances in Neural Information Processing Systems, vol. 14, ed. by T.G. Dietterich, S. Becker, Z. Ghahramani (MIT Press, Cambridge, 2002), pp. 673–680
Y. Freund, Self bounding learning algorithms, in Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI (Morgan Kaufmann, San Mateo, 1998)
V. Ganti, J. Gehrke, R. Ramakrishnan, DEMON: mining and monitoring evolving data, in Proceedings of the Sixteenth International Conference on Data Engineering, San Diego, CA (2000), pp. 439–448
J. Gehrke, V. Ganti, R. Ramakrishnan, W.-L. Loh, BOAT: optimistic decision tree construction, in Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, PA (ACM, New York, 1999), pp. 169–180
J. Gratch, Sequential inductive learning, in Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR (AAAI Press, Menlo Park, 1996), pp. 779–786
R. Greiner, PALO: a probabilistic hill-climbing algorithm. Artif. Intell. 84, 177–208 (1996)
W. Hoeffding, Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)
G. Hulten, P. Domingos, Mining complex models from arbitrarily large databases in constant time, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada (ACM Press, New York, 2002), pp. 525–531
G. Hulten, P. Domingos, VFML—a toolkit for mining high-speed time-changing data streams (2003). http://www.cs.washington.edu/dm/vfml/
G. Hulten, P. Domingos, Y. Abe, Mining massive relational databases, in IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, Acapulco, Mexico (2003)
G. Hulten, L. Spencer, P. Domingos, Mining time-changing data streams, in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA (ACM, New York, 2001), pp. 97–106
O. Maron, A. Moore, Hoeffding races: accelerating model selection search for classification and function approximation, in Advances in Neural Information Processing Systems 6, ed. by J.D. Cowan, G. Tesauro, J. Alspector (Morgan Kaufmann, San Mateo, 1994)
M. Mehta, A. Agrawal, J. Rissanen, SLIQ: a fast scalable classifier for data mining, in Proceedings of the Fifth International Conference on Extending Database Technology, Avignon, France (Springer, Berlin, 1996), pp. 18–32
R. Musick, J. Catlett, S. Russell, Decision theoretic subsampling for induction on large databases, in Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA (Morgan Kaufmann, San Mateo, 1993), pp. 212–219
J.R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo, 1993)
T. Scheffer, S. Wrobel, Incremental maximization of non-instance-averaging utility functions with applications to knowledge discovery problems, in Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA (Morgan Kaufmann, San Mateo, 2001), pp. 481–488
J.C. Shafer, R. Agrawal, M. Mehta, SPRINT: a scalable parallel classifier for data mining, in Proceedings of the Twenty-Second International Conference on Very Large Databases, Bombay, India (Morgan Kaufmann, San Mateo, 1996), pp. 544–555
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, M. Anthony, Structural risk minimization over data-dependent hierarchies. Technical report NC-TR-96-053, Department of Computer Science, Royal Holloway, University of London, Egham, UK (1996)
P.E. Utgoff, An improved algorithm for incremental induction of decision trees, in Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ (Morgan Kaufmann, San Mateo, 1994), pp. 318–325
A. Wald, Sequential Analysis (Wiley, New York, 1947)
A. Wolman, G. Voelker, N. Sharma, N. Cardwell, M. Brown, T. Landray, D. Pinnel, A. Karlin, H. Levy, Organization-based analysis of Web-object sharing and caching, in Proceedings of the Second USENIX Conference on Internet Technologies and Systems, Boulder, CO (1999), pp. 25–36
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Hulten, G., Domingos, P. (2016). Mining Decision Trees from Streams. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds) Data Stream Management. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28608-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-28608-0_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28607-3
Online ISBN: 978-3-540-28608-0
eBook Packages: Computer ScienceComputer Science (R0)