Abstract
Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series.
In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agrawal R, Psaila G, Wimmers EL, Zait M (1995) Querying shapes of histories. In: Proceedings of the 21st International conference on very large databases, Zurich, Switzerland, Sept 11–15, pp 502–514
Andre-Jonsson H, Badal D (1997) Using signature files for querying time-series data (1997) In: Proceedings of principles of data mining and knowledge discovery, 1st European symposium, Trondheim, Norway. June 24–27, pp 211–220
Androulakis IP (2005) New approaches for representing, analyzing and visualizing complex kinetic mechanisms. In: Proceedings of the 15th European symposium on computer aided process engineering, Barcelona, Spain. May 29–June 1
Apostolico A, Bock ME, Lonardi S (2002) Monotony of surprise in large-scale quest for unusual words. In: Proceedings of the 6th International conference on research in computational molecular biology, Washington, DC, April 18–21, pp 22–31
Bagnall AJ, Janakec G (2004) Clustering time series from arma models with clipped data. In: Proceedings of the 10th ACM SIGKDD International conference on knowledge discovery and data mining, Seattle, WA. August 22–25, pp 49–58
Bakalov P, Hadjieleftherious M, Tsotras VJ (2005) Time relaxed spatiotemporal trajectory. In: Proceedings of the ACM international symposium on advances in geographic information systems, Bremen, Germany. November 4–5
Bastogne T, Noura H, Richard A, Hittinger JM (2002) Application of subspace methods to the identification of a winding process. In: Proceedings of the 4th European control conference, Brussels, Belgium
Berndt D, Clifford J (1994) Using dynamic time warping to find patterns in time series, The workshop on knowledge discovery in databases, the 12th International conference on artificial intelligence, Seattle, WA, pp 229–248
Celly B, Zordan VB (2004) Animated people textures. In: Proceedings of the 17th international conference on computer animation and social agents, Geneva, Switzerland, July 7–9
Chan K, Fu AW (1999) Efficient time series matching by wavelets. In: Proceedings of the 15th IEEE International conference on data engineering, Sydney, Australia, March 23–26, pp 126–133
Chen JS, Moon YS, Yeung HW (2005) Palmprint authentication using time series. In: Proceedings of the 5th international conference on audio- and video-based biometric person authentication, Hilton Rye Town, NY, July 20–22
Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the 9th ACM SIGKDD International conference on knowledge discovery and data mining, Washington DC, USA. August 24–27, pp 493–498
Dasgupta D, Forrest S (1999) Novelty detection in time series data using ideas from immunology. In: Proceedings of the 8th International conference on intelligent systems, Denver, CO, June 24–26
Daw CS, Finney CEA, Tracy ER (2001) Symbolic analysis of experimental data. Review of Scientific Instruments 74:915–930
Ding C, He X, Zha H, Simon H (2002) Adaptive dimension reduction for clustering high dimensional data. In: Proceedings of the 2nd IEEE International conference on data mining. Maebashi, Japan, December 9–12, pp 147–154
Duchene F, Garbay C (2005) Apprentissage de Motifs Temporels, Multidimensionnels et Heterogenes: Application a la Telesurveillance Medicale. In: Proceedings of conference francophone sur l’apprentissage Automatique. Nice, France. May 31–June 3
Duchene F, Garbay C, Rialle V (2004) Mining heterogeneous multivariate time-series for learning meaningful patterns: application to home health telecare. Research Report 1070-I, Institue de Informatique et Mathematiques Appliquees de Grenoble, Grenoble, France
Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press
Faloutsos C, Ranganathan M, Manolopulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Record, vol 23, pp 419–429
Fayyad U, Reina C, Bradley P (1998) Initialization of iterative refinement clustering algorithms. In: Proceedings of the 4th International conference on knowledge discovery and data mining, New York, NY, August 27–31, pp 194–198
Ferreira PG, Azevedo P, Silva C, Brito R (2006) Mining approximate motifs in time series. In: Proceedings of the 9th international conference on discovery science, Barcelona, Spain, October 7–10
Gavrilov M, Anguelov D, Indyk P, Motwani R (2000) Mining the stock market: which measure is best? In: Proceedings of the 6th ACM International conference on knowledge discovery and data mining, Boston, MA, August 20–23, pp 487–496
Geurts P (2001) Pattern extraction for time series classification. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery, Freiburg, Germany, pp 115–127
Gionis A, Mannila H (2003) Finding recurrent sources in sequences. In: Proceedings of the 7th International conference on research in computational molecular biology. Berlin, Germany, pp 123–130
Hellerstein JM, Papadimitriou CH, Koutsoupias E (1997) Towards an analysis of indexing schemes. In: Proceedings of the 16th ACM symposium on principles of database systems, Tucson, AZ, May 12–14, pp 249–256
Huang YW, Yu PS (1999) Adaptive query processing for time-series data. In: Proceedings of the 5th ACM SIGKDD International conference on knowledge discovery and data mining, San Diego, CA, Aug 15–18, pp 282–286
Hugueney B (2006) Adaptive segmentation-based symbolic representation of time series for better modeling and lower bounding distance measures. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases, Berlin, Germany, September 18–22, pp 545–552
Kalpakis K, Gada D, Puttagunta V (2001) Distance measures for effective clustering of arima time-series. In: Proceedings of the 2001 IEEE International conference on data mining, San Jose, CA, November 29-December 2, pp 273–280
Keogh E, Chakrabarti K, Pazzani M (2001a) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of ACM SIGMOD conference on management of data, Santa Barbara, May 21–24, pp 151–162
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001b) Dimensionality reduction for fast similarity search in large time series databases. J Knowledge Inform Syst. 3:263–286
Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the 8th ACM SIGKDD International conference on knowledge discovery and data mining, Edmonton, Alberta, Canada, July 23–26, pp 102–111
Keogh E, Lin J, Fu AW (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: Proceedings of the 5th IEEE international conference on data mining, Houston, TX, November 27–30, pp 226–233
Keogh E, Lonardi S, Chiu B (2002) Finding surprising patterns in a time series database in linear time and space. In: Proceedings of the 8th ACM SIGKDD International conference on knowledge discovery and data mining, Edmonton, Alberta, Canada, July 23–26, pp 550–556
Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle. August 22–25, pp 206–215
Keogh E, Pazzani M (1998) An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Proceedings of the 4th International conference on knowledge discovery and data mining, New York, NY, August 27–31, pp 239–241
Kumar N, Lolla N, Keogh E, Lonardi S, Ratanamahatana CA, Wei L (2005) Time series bitmaps: a practical visualization tool for working with large time series databases. In: Proceedings of the 2005 SIAM international conference on data mining, Newport Beach, CA, April 21–23, pp 531–535
Larsen RJ, Marx ML (1986) An introduction to mathematical statistics and its applications, 2nd edn. Prentice Hall, Englewood, Cliffs, NJ
Lin J, Keogh E (2006) Group SAX: extending the notion of contrast sets to time series and multimedia data. In: Proceedings of the 10th european conference on principles and practice of knowledge discovery in databases. Berlin, Germany, September 18–22, pp 284–296
Lin J, Keogh E, Lonardi S (2005) Visualizing and discovering non-trivial patterns in large time series databases. Inform Visual 4:61–82
Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms, Workshop on Research Issues in Data Mining and Knowledge Discovery, the 8th ACM SIGMOD, San Diego, CA
Lin J, Keogh E, Lonardi S, Lankford JP, Nystrom DM (2004) Visually mining and monitoring massive time series. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, August 22–25, pp 460–469
Lin J, Keogh E, Patel P, Lonardi S (2002) Finding motifs in time series, the 2nd Workshop on Temporal Data Mining, the 8th ACM International conference on knowledge discovery and data mining, Edmonton, Alberta, Canada, pp 53–68
Lkhagva B, Suzuki Y, Kawagoe K (2006) New time series data representation ESAX for financial applications. In: Proceedings of the 22nd international conference on data engineering workshops, Atlanta, GA, April 3–8, pp 115
Lonardi S (2001) Global detectors of unusual words: design implementation and applications to pattern discovery in biosequences. Department of Computer Sciences, Purdue University
McGovern A, Kruger A, Rosendahl D, Droegemeier K (2006) Open problem: dynamic relational models for improved hazardous weather prediction. In: Proceedings of ICML workshop on open problems in statistical relational learning, Pittsburgh, PA, June 29
Mörchen F, Ultsch A (2005) Optimizing time series discretization for knowledge discovery. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, IL, August 21–24, pp 660–665
Murakami K, Yano Y, Doki S, Okuma S (2004) Behavior extraction from a series of observed robot motion. In: Proceedings of JSME conference on robotics and mechatronics. Nagoya, Japan, June
NIST/SEMATECH e-Handbook of Statistical Methods. http://www.itl.nist.gov/div898/handbook/
Ohsaki M, Sato Y, Yokoi H, Yamaguchi T (2003) A rule discovery support system for sequential medical data, in the case study of a chronic hepatitis dataset. Discovery challenge workshop, the 14th European conference on machine learning/the 7th european conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, Croatia
Pouget F, Urvoy-Keller G, Dacier M (2006) Time signature to detect multi-headed stealthy attack tools. In: Proceedings of the 18th annual first conference, Baltimore, MD, June 25–30
Ratanamahatana CA, Keogh E, Bagnall AJ, Lonardi S (2005) A novel bit level time series representation with implications for similarity search and clustering. In: Proceedings of advances in knowledge discovery and data mining, 9th Pacific-Asia conference, Hanoi Vietnam, May 18–20, pp 771–777
Reinert G, Schbath S, Waterman MS (2000) Probabilistic and statistical properties of words: an overview. J Comput Biol 7:1–46
Roddick, JF, Hornsby K, Spiliopoulou M (2001) An updated bibliography of temporal, spatial and spatio-temporal data mining research. Post-workshop proceedings of the international workshop on temporal, spatial and spatio-temporal data mining, Springer, Berlin, pp 147–163
Shahabi C, Tian X, Zhao W (2000) TSA-tree: a wavelet-based approach to improve the efficiency of multi-level surprise and trend queries. In: Proceedings of the 12th international conference on scientific and statistical database management, Berlin, Germany, July 26–28, pp 55–68
Silvent A, Carbay C, Carry, PY, Dojat M (2003) Data information and knowledge for medical scenario construction. In: Proceedings of the intelligent data analysis in medicine and pharmacology workshop, Protaras, Cyprus, October 19–22
Silvent A, Dojat M, Garbay C (2004) Multi-level temporal abstraction for medical scenario construction. Int J Adapt Control Signal Process
Staden R (1989) Methods for discovering novel motifs in nucleic acid sequences. Comput Appl Biosci 5:293–298
Tanaka Y, Uehara K (2003) Discover motifs in multi dimensional time-series using the principal component analysis and the MDL principle. In: Proceedings of the 3rd international conference on machine learning and data mining in pattern recognition, Leipzig, Germany, July 5–7, pp 252–265
Tanaka Y, Uehara K (2004) Motif discovery algorithm from motion data. In: Proceedings of the 18th annual conference of the japanese society for artificial intelligence, Kanazawa, Japan, June 2–4
Tompa M, Buhler J (2001) Finding motifs using random projections. In: Proceedings of the 5th International conference on computational molecular biology, Montreal, Canada, April 22–25, pp 67–74
Vlachos M, Kollios G, Gunopulos G (2002) Discovering similar multidimensional trajectories. In: Proceedings of the 18th international conference on data engineering, San Jose, CA, February 26-Mar 1, pp 673–684
Wei L, Keogh E, Xi X (2006) SAXually explicit images: finding unusual shapes. In: Proceedings of the 2006 IEEE international conference on data mining, Hong Kong, December 18–22
Yi BK, Faloutsos C (2000) Fast time sequence indexing for arbitrary lp norms. In: Proceedings of the 26th international conference on very large databases, Cairo, Egypt, September 10–14, pp 385–394
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Johannes Gehrke.
Rights and permissions
About this article
Cite this article
Lin, J., Keogh, E., Wei, L. et al. Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Disc 15, 107–144 (2007). https://doi.org/10.1007/s10618-007-0064-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-007-0064-z