Abstract
Naïve extensions of uni-variate prediction techniques lead to an unwelcome increase in the cost of multi-variate model learning and significant deteriorations in the model performance. In this paper, we first argue that (a) one can learn a more accurate forecasting model by leveraging temporal alignments among variates to quantify the importance of the recorded variates with respect to a target variate. We further argue that, (b) for this purpose we need to quantify temporal correlation, not in terms of series similarity, but in terms of temporal alignments of key “events” impacting these series. Finally, we argue that (c) while learning a temporal model using recurrence based techniques (such as RNN and LSTM—even when leveraging attention strategies) is difficult and costly, we can achieve better performance by coupling simpler CNNs with an adaptive variate selection strategy. Relying on these arguments, we propose a Selego framework (Selego is a word of latin origin meaning “selection”) for variate selection and experimentally evaluate the performance of the proposed approach on various forecasting models, such as LSTM, RNN, and CNN, for different top-X% variates and different forecasting time in the future (lead) on multiple real-world datasets. Experiments show that the proposed framework can offer significant (\(90-98\%\)) drops in the number of recorded variates that are needed to train predictive models, while simultaneously boosting accuracy.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
While the terms “variate” and “feature” are often used interchangeably, in this paper, we make a clear distinction: A “variate” is an input time series describing a time-varying property of the system being observed, whereas a “feature” is a temporal pattern extracted from a given time series and can be used to characterize that series.
Without loss of generality, in the experiments reported in Sect. 3, we consider target sets each with a single variate (i.e., \(|{\mathbb {Y}}|\) = 1).
Our source codes and the public data sets used in these experiments are available .
Results presented in this paper were obtained using NSF testbed: “Chameleon: A Large-Scale Re-configurable Experimental Environment for Cloud Research”
Since the components of the FRESH feature vector are of potentially of very different scales, each component has been re-scaled to between 0 and 1 to prevent large valued components from having undue bias in the final ranking.
We report the best model performance across 200 epochs.
References
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J et al (2016) Tensorflow: A system for large-scale machine learning. In USENIX, OSDI
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In ICLR
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. JMLR
Berndt DJ, Clifford J (1994) Using dynamic time warping to find patterns in time series. In KDD workshop
Bianco V, Manca O, Nardini S (2009) Electricity consumption forecasting in italy using linear regression models. Energy 34(9):1413–1421
Blei DM, Lafferty JD (2006) Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (ICML ’06), pp 113–120
Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis, forecasting and control, 5th ed., Wiley
Candan KS, Rossini R, Sapino ML, Wang X (2012) sdtw: Computing dtw distances using locally relevant constraints based on salient feature alignments. VLDB
Chen L, Ng R (2004) On marriage of lp-norms and edit distance. In VLDB
Christ M, Braun N, Neuffer J, Kempa-Liehr A (2018) Time series feature extraction on basis of scalable hypothesis tests (tsfresh – a python package)
Clevert D-A, Unterthiner T, Hochreiter S (2016) Fast and accurate deep network learning by exponential linear units (elus). ICLR
Drucker H, Burges CJ, Kaufman L, Smola AJ, Vapnik V (1997) Support vector regression machines. In NIPS
Fernandez-Fraga S, Aceves-Fernandez M (2018) Feature extraction of eeg signal upon bci systems based on steady-state visual evoked potentials using the ant colony optimization. Discrete Dynamics in Nature and Society
Garg Y, Candan KS (2019) Racknet: Robust allocation of convolutional kernels in neural networks for image classification. In ICMR
Garg Y, Candan KS (2021a) Sdma: Saliency-driven mutual cross attention for multi-variate time series. In ICPR, IEEE
Garg Y, Candan KS (2021b) Xm2a:multi-scale multi-head attention with cross talk for time series analysis. In MIPR
Goodwin P, Dargay J, Hanly M (2004) Elasticities of road traffic and fuel consumption with respect to price and income: a review. Transport reviews
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In CVPR
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Keogh E, Ratanamahatana CA (2005) Exact indexing of dynamic time warping. Knowl Inf Syst 7:358–386
Linderman S, Adams R (2014) Discovering latent network structure in point process data. In ICML’14, PMLR 32(2), pages 1413–1421
Lin J, Keogh E, Lonardi S, Patel P (2002) Finding motifs in time series. In Workshop on Temporal Data Mining
Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV
Mueen A, Keogh E (2016) Extracting optimal performance from dynamic time warping. In SIGKDD
Pearson K (1901) Principal components analysis. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science
Qin Y, Song D, Chen H, Cheng W, Jiang G, Cottrell G (2017) A dual-stage attention-based recurrent neural network for time series prediction. IJCAI
Roffo G, Melzi S, Cristani M (2015) Infinite feature selection. In ICCV
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536
Sakoe H, Chiba S (1978) Dynamic programming algorithm optimization for spoken word recognition. TASSP
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill, Inc
Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In ICDE, IEEE
Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D (2015) Going deeper with convolutions. In CVPR
Tong H, Faloutsos C, Pan J-Y (2006) Fast random walk with restart and its applications. In ICDM, page 613–622
Tucker L (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31:279–311
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In CVPR
Yuan B, Li H, Bertozzi AL, Brantingham PJ, Porter MA (2019) Multivariate spatiotemporal hawkes processes and network reconstruction. SIAM J. Math. Data Sci. 1(2):356–382
Zoph B, Le QV (2017) Neural architecture search with reinforcement learning. ICLR
Acknowledgements
This work is partially supported by NSF#1827757 “Building Doctor’s Medicine Cabinet (BDMC): Data-Driven Services for High Performance and Sustainable Buildings”, NSF#1610282 “DataStorm: A Data Enabled System for End-to-End Disaster Planning and Response”, NSF#1633381 “BIGDATA: Discovering Context-Sensitive Impact in Complex Systems”, NSF#1909555 “pCAR: Discovering and Leveraging Plausibly Causal (p-causal) Relationships to Understand Complex Dynamic Systems”, and DOE grant “Securing Grid-interactive Efficient Buildings (GEB) through Cyber Defense and Resilient System (CYDRES)”. Part of the research was carried out using the Chameleon testbed supported by the NSF.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix—sample series and feature distributions
Appendix—sample series and feature distributions
a The target variable NDX (NASDAQ index); b the best 6 series aligned with it (note that alignment of series do not necessarily imply that the series are globally similar – it only means that they show evidence of the same underlying events); c a poorly aligned series; d–k temporal distributions (time and length) of the identified features in these series (here the X-axis denotes the time and the Y-axis identifies the length of the feature identified at a particular point in time)
a The target variable AAPL (symbol for the Apple stock); b the best 6 series aligned with it(note that alignment of series do not necessarily imply that the series are globally similar – it only means that they show evidence of the same underlying events); c a poorly aligned series; d–k temporal distributions (time and length) of the identified features in these series (here the X-axis denotes the time and the Y-axis identifies the length of the feature identified at a particular point in time)
a The target variable fuel consumption; b the best 6 series aligned with it (note that alignment of series do not necessarily imply that the series are globally similar—it only means that they show evidence of the same underlying events); c a poorly aligned series; d–k temporal distributions (time and length) of the identified features in these series (here the X-axis denotes the time and the Y-axis identifies the length of the feature identified at a particular point in time)
Figures 7 through 9 provide examples of target variables, the best series aligned based on feature distributions, along with a sample for poorly aligned series. In order to better visualize the feature alignments, consecutive series (e.g. the consecutive days in NASDAQ) have been concatenated and the number of feature layers considered in these charts have been raised from the number of layers considered in the experiments. As we see in these figures, temporal alignment of variates does not mean that they must look similar: instead, alignment only means that the two series show evidence of being impacted from the same underlying events. In Fig. 9b, for example, we see six variates that, together, predict the fuel consumption series 9a well. We also see in the figure that these series used for model training are temporally aligned with the target series but are not necessarily similar to it.
Rights and permissions
About this article
Cite this article
Tiwaskar, M., Garg, Y., Li, X. et al. Selego: robust variate selection for accurate time series forecasting. Data Min Knowl Disc 35, 2141–2167 (2021). https://doi.org/10.1007/s10618-021-00777-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-021-00777-1