Abstract
Time series anomaly detection is one of the most active areas of research in data mining, with dozens of new approaches been suggested each year. In spite of all these creative solutions proposed for this problem, recent empirical evidence suggests that the time series discord, a relatively simple twenty-year old distance-based technique, remains among the state-of-art techniques. While there are many algorithms for computing the time series discords, they all have limitations. First, they are limited to the batch case, whereas the online case is more actionable. Second, these algorithms exhibit poor scalability beyond tens of thousands of datapoints. In this work we introduce DAMP, a novel algorithm that addresses both these issues. DAMP computes exact left-discords on fast arriving streams, at up to 300,000 Hz using a commodity desktop. This allows us to find time series discords in datasets with trillions of datapoints for the first time. We will demonstrate the utility of our algorithm with the most ambitious set of time series anomaly detection experiments ever conducted. We will further show that our speedup improvements can be applied in the multidimensional case.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Note that some papers misattribute the success of their anomaly detection to the Matrix Profile or to HOTSAX, but these are simple different algorithms to compute time series discords.
This observation is true for heartbeats, gaits, machine cycles etc. One exception is for events tied to a cultural calendar. For example, for taxi demand or electrical power demand, the most similar day to any given day, is not the previous day, but the same day one week earlier.
Actually, using exactly one heartbeat (or pattern more generally), may make the downstream algorithms brittle to the choice of the starting point of the heartbeat. To bypass this issue, we always extract two consecutive beats.
Here we explain “non-trivial anomaly detector”. Simple rule-based conditionals such as: “if the time series ever reports a value that is higher than any value you have seen before, then flag anomaly” could be used as an anomaly detector, and could be instantaneously instantiated. By non-trivial we mean any TSAD algorithm that examines each subsequence for any information about shape, autocorrelation, Markov properties etc., and compares this information (in the most general sense), to a model gleaned from training data. The reader will appreciate that this includes essentially all proposed anomaly detectors in the literature.
References
Aubet F-X, Zügner D, Gasthaus J (2021) Monte Carlo EM for deep time series anomaly detection. arXiv:2112.14436 [cs, stat]
Audibert J, Marti S, Guyard F, Zuluaga MA (2021) From univariate to multivariate time series anomaly detection with non-local information. In: Lemaire V, Malinowski S, Bagnall A et al (eds) Advanced analytics and learning on temporal data. Springer International Publishing, Cham, pp 186–194
Boniol P, Linardi M, Roncallo F et al (2021a) Unsupervised and scalable subsequence anomaly detection in large data series. VLDB J 30:909–931. https://doi.org/10.1007/s00778-021-00655-8
Boniol P, Paparrizos J, Palpanas T, Franklin MJ (2021b) SAND: streaming subsequence anomaly detection. Proc VLDB Endow 14:1717–1729
Case Western Reserve University Bearing Data Center (2021) Available: https://csegroups.case.edu/bearingdatacenter/home. Accessed: Nov. 15, 2021
CNC Crashes. Video. (15 Feb 2018). from https://youtu.be/t2tBtZCa7j4?t=205. Retrieved December 20, 2021
Daigavane A, Wagstaff KL, Doran G et al (2022) Unsupervised detection of Saturn magnetic field boundary crossings from plasma spectrometer data. Comput Geosci 161:105040
DAMP (2022) https://sites.google.com/view/discord-aware-matrix-profile
Dau HA, Bagnall A, Kamgar K et al (2019) The UCR time series archive. IEEE/CAA J Autom Sin 6:1293–1305. https://doi.org/10.1109/JAS.2019.1911747
Doshi K, Abudalou S, Yilmaz Y (2022) TiSAT: time series anomaly transformer. arXiv:2203.05167 [cs, eess, stat]
Higham NJ (2002) Accuracy and stability of numerical algorithms, 2 edn. ISBN: 978-0-89871-521-7
Hundman K, Constantinou V, Laporte C et al (2018) Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, London United Kingdom, pp 387–395
Imani S, Madrid F, Ding W et al (2020) Introducing time series snippets: a new primitive for summarizing long time series. Data Min Knowl Disc 34:1713–1743. https://doi.org/10.1007/s10618-020-00702-y
Keogh E (2021) Irrational exuberance why we should not believe 95% of papers on time series anomaly detection. In: 7th SIGKDD workshop on mining and learning from time series at SIGKDD 2021. Workshop Keynote https://www.youtube.com/watch?v=Vg1p3DouX8w&t=324s
Khansa HE, Gervet C and Brouillet A (2012) Prominent discord discovery with matrix profile: application to climate data insight. In: 10th international conference of advanced computer science & information technology (ACSIT 2022) May 21~22, 2022, Zurich, Switzerland
Kirti R, Karadi R (2012) Cardiac tamponade: atypical presentations after cardiac surgery. Acute Med 11:93–96
Mueen A, Zhu Y, Yeh M et al (2017) The fastest similarity search algorithm for time series subsequences under euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.htmlAccessed 24 Janurary, 2022
Nakamura T, Imamura M, Mercer R, Keogh E (2020) Merlin: parameter-free discovery of arbitrary length anomalies in massive time series archives. In: 2020 IEEE international conference on data mining (ICDM). IEEE, Sorrento, Italy, pp 1190–1195
National Weather Service. January 24, 2019 heavy rain and flooding. From https://www.weather.gov/aly/24Jan19HeavyRainFlood. Retrieved May 1 2022
Neupane D, Seok J (2020) Bearing fault detection and diagnosis using case western reserve university dataset with deep learning approaches: A review. IEEE Access 8:93155–93178. https://doi.org/10.1109/ACCESS.2020.2990528
Nilsson F (2022) Joint human-machine exploration of industrial time series using the matrix profile. In: Halmstad university, school of information technology, Halmstad embedded and intelligent systems research (EIS), CAISR—center for applied intelligent systems research
Palpanas T (2022) Personal communication June 4th 2022
Paparrizos J, Kang Y, Boniol P et al (2022) TSB-UAD: An end-to-end benchmark suite for univariate time-series anomaly detection. In: Proceedings of the VLDB endowment (PVLDB) journal
Park D, Hoshi Y, Kemp CC (2018) A multimodal anomaly detector for robot-assisted feeding using an LSTM-based variational autoencoder. IEEE Robot Autom Lett 3:1544–1551. https://doi.org/10.1109/LRA.2018.2801475
Park JY, Wilson E, Parker A, Nagy Z (2020) The good, the bad, and the ugly: data-driven load profile discord identification in a large building portfolio. Energy Build 215:109892
Silive.com. Wild storm pelts Staten Island with giant hail—‘threat of tornado has passed’ from https://www.silive.com/news/2019/05/nws-issues-tornado-warning-for-staten-island.html. Retrieved May 1 2022
Su Y, Zhao Y, Niu C et al (2019) Robust anomaly detection for multivariate time series through stochastic recurrent neural network pp 2828–2837
Thill M, Konen W, Bäck T (2020) Time series encodings with temporal convolutional networks. Springer, Cham, pp 161–173
Truong HT, Ta BP, Le QA et al (2022) Light-weight federated learning-based anomaly detection for time-series data in industrial control systems. Comput Ind 140:103692. https://doi.org/10.1016/j.compind.2022.103692
Wastewater News. Valentine’s day storm slams California, pushing water agencies to the edge. From www.news.cornell.edu/Chronicle/00/5.18.00/wireless_class.html. Retrieved Dec 1 2021
Wikipedia. Leap year problem. from https://en.wikipedia.org/wiki/Leap_year_problem. Retrieved December 1, 2021
Wu R, Keogh E (2021) Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2021.3112126
Yeh C-CM, Zheng Y, Wang J et al (2021) Error-bounded approximate time series joins using compact dictionary representations of time series. CoRR abs arXiv:2112.12965
Yeh C-CM, Zhu Y, Dau HA et al (2019) Online amnestic dtw to allow real-time golden batch monitoring. pp 2604–2612
Zheng X, Xu N, Trinh L et al (2021) PSML: a multi-scale time-series dataset for machine learning in decarbonized energy grids. arXiv preprint arXiv: 2110.06324
Zhu Y, Yeh C-CM, Zimmerman Z et al (2018) Matrix profile XI: SCRIMP++: time series motif discovery at interactive speeds. In: IEEE pp 837–846
Funding
This research was supported by NSF OIA-1757207, CNS-2008910 and RI-2104537, the French National Research Agency (ANR-19-P3IA-0002), and NSF 2103976, Mitsubishi, Visa and Toyota.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest.
Ethical approval
This research complies with all ethical guidelines at the three institutions represented.
Human or animal rights
No human subjects were used by the current authors. The ECG data was collected by others over a decade ago, under strict human subject protocols.
Additional information
Communicated by Johannes Fürnkranz.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: tool to set window size
Appendix: tool to set window size
For most of the experiments in this paper, we use human intuition to manually set the value of the parameter m. We followed the well-known “folk wisdom” that was explained and tested in Nakamura et al. (2020). The idea is simply this. Plot a subset of the data, estimate the length of a single period “by-eye” and set m to a value a little less than a period. Here we used m = 90% of estimated period length. In (Nakamura et al. 2020) the authors show that in most cases, the anomalies returned doing this, would not change even if you set m to twice or half this value (They were using the full MP, not the left MP, but we found similar results with DAMP).
However, for the experiment in Table 9, we want to completely exclude any human intervention or tuning. Therefore, we used the following four lines of Matlab code to automatically calculate the window size for the 250 KDD Cup 2021 datasets.
The window size is obtained by finding the peak of autocorrelation in the range of 10–1000 (the value of parameter m is limited to the range of 10–1000). To avoid the ‘findpeaks’ function returning a null value (very unlikely, but possible), we set the default value of m to 1000.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, Y., Wu, R., Mueen, A. et al. DAMP: accurate time series anomaly detection on trillions of datapoints and ultra-fast arriving data streams. Data Min Knowl Disc 37, 627–669 (2023). https://doi.org/10.1007/s10618-022-00911-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-022-00911-7