1 Introduction

Online learning (OL) is an important field of machine learning research which allows supervised learning to be conducted on data streams [9, 30]. Learning from data streams can be challenging, particularly in environments that are non-stationary in their nature, which can cause concept drift [9]. Concept drift occurs when the underlying concept changes over time, causing changes to the distribution of data, and requires predictive models to be updated or discarded to maintain effective predictions. To build accurate models, many real-world applications require large amounts of training data, which is often limited if concept drifts occur [24].

Transfer learning (TL) is another prominent field of machine learning research, which allows models to be learnt in domains where training data is readily available, and used where it is limited to build more effective predictors [24]. TL has typically been conducted offline, limiting its use in real-world online environments [36]. It may be desirable to use on-device learning to personalise the functionalities of user facing applications, where a rich history of data may not be available locally due to memory limitations, and drifts may be encountered frequently. Predictive performances could be enhanced using TL in an online setting by using knowledge learnt from other data streams to aid the target predictor.

The Online Transfer Learning (OTL) framework, developed by Zhao et al. [36], was proposed to enable TL to be used within an online setting. Current versions of OTL assume the source is in an offline environment [10, 12, 32], ignoring the possibility of concept drift occurring in a source domain.

In this paper, we propose the Bi-directional Online Transfer Learning (BOTL) framework,Footnote 1 which considers source and target domains to be online. This has three benefits over existing approaches. Firstly, individual concepts are learnt in a source domain, using concept drift detection strategies, and transferred to other domains to improve their predictive performance [9].

Secondly, as new concepts are encountered in a source domain, additional knowledge of the new concept is transferred to other domains. Thirdly, knowledge can be transferred bi-directionally, enabling more effective predictions to be made in both source and target domains. Specifically, we:

  • Introduce the BOTL framework, enabling each domain to benefit from online TL in a regression setting,

  • Consider the theoretical loss of BOTL, showing predictions made by BOTL are no worse than the underlying concept drift detection algorithm,

  • Introduce a novel drift detector, AWPro, which has combined benefits of RePro [34] and ADWIN [3], and

  • Show the performance of BOTL exceeds an existing state-of-the-art online transfer learning technique and existing concept drift detection algorithms with no knowledge transfer using a variety of datasets.

We evaluate BOTL in a regression setting using two synthetic datasets and one real-world dataset containing both sudden and gradual drifts. We use BOTL in conjunction with three concept drift detection strategies to identify the underlying drifts occurring locally in each domain, namely RePro [34] and ADWIN [3], and a novel drift detection algorithm, Adaptive Windowing with Proactive drift detection (AWPro). AWPro combines key characteristics exhibited by ADWIN and RePro that are beneficial to the BOTL framework when used within applications that have computational and communication limitations.

We compare BOTL with a state-of-the-art online TL framework, the Generalised Online Transfer Learning (GOTL) framework, which assumes the source is offline [12].

The remainder of this paper is organised as follows. Section 2 outlines related work. Section 3 formulates the setting in which BOTL is used. Section 4 presents the proposed framework, and the theoretical loss of BOTL is presented in Section 5. Section 6 specifies the datasets used to investigate the applicability of underlying concept drift detection strategies, and evaluate the BOTL framework. Section 7 outlines the three concept drift detection strategies and discusses their limitations for use in BOTL. Section 8 presents empirical results of BOTL using a variety of datasets, highlighting the beneficial characteristics of concept drift detection strategies, robustness to noise, and applicability to real-world data. Finally, Section 9 concludes the paper.

2 Related work

Online TL combines OL and TL. The aim of TL is to use knowledge learnt for a predictive task in one domain, referred to as the source, to improve the effectiveness of predictions in another domain, referred to as the target [23]. There are three distinct types of TL: inductive, transductive, and unsupervised [1, 5, 24]. Inductive TL is used when source and target predictive tasks are different. Knowledge is transferred from the source to induce a supervised predictive function in the target [5]. Typically, large amounts of labelled target data are required to create a mapping between domains [24]. Unsupervised TL is applied in a similar way, but to unsupervised learning tasks, such as clustering [24]. Transductive TL is used when the source and target predictive tasks are the same, transferring knowledge to improve the predictive performance in a target domain where no labelled data is available [1]. TL can be further categorised as homogeneous, where the domains of source and target are the same, or heterogeneous, where they differ [36]. In this paper, we consider a homogeneous setting, and use inductive TL to improve the predictive performances within both source and target domains.

It is desirable for many modern applications, such as smart home heating systems, to predict future events from historical data. However, applications are often limited by memory constraints, preventing a complete history of data being retained [11]. Additionally, due to the dynamic and non-stationary environment of data streams, the underlying concept may evolve or drift over time [18]. Concept drift is a change in the distribution of the observed data, or a change in the mapping between observations and response variables [15]. If the underlying concept changes, the previously built model may no longer make effective predictions, requiring the model to be modified or re-learnt [9].

To maintain effective predictions, concept drift detection algorithms are frequently used in OL. Concept drift detection algorithms typically use a sliding window to maintain a subset of recent instances, usually used to update or rebuild the predictive model. Strategies to update a model include ensemble learning approaches, where the window of recent instances is used to create a new model and combined with previously learnt models to improve the predictive performance. Model predictions are aggregated; for example, Dynamic Weighted Majority (DWM) uses a mean weighted by the models’ estimated performance [19].

Alternatively, concept drift detection algorithms such as ADWIN [3] use the window of data to detect concept drifts, and once a drift has been detected, a new predictive model can be learnt that represents the current concept independently of previous concepts. A challenge associated with these concept drift detection strategies is that every time a concept is encountered, a new model must be learnt, and data must be collected to build each model, even if that concept has previously been encountered. RePro [33] uses an approach similar to ADWIN, but retains a history of concepts and concept transitions to prevent learning new models for recurring concepts [34]. This prevents the need to collect new data each time a recurring concept is encountered; however, data must still be collected to build models for new concepts. For many real-world applications, particularly those that are user facing, knowledge obtained from other data streams could enhance predictions when new concepts are encountered through the use of online TL.

Existing online TL frameworks aim to transfer knowledge learnt from an offline source to an online target for classification tasks. OTL [36] combines the offline source model with the online target model using a weighting mechanism that is updated with respect to the performance of the source and target models on a sliding window of data in the target domain. Other online TL frameworks [10, 17, 31] use similar strategies to combine transferred models specifically for classification tasks, and cannot easily be adapted to regressive settings. GOTL [12] extends the OTL weighting mechanism such that online TL can be used for both classification and regression. The weighting mechanism used by GOTL incrementally updates in steps to obtain weightings for source and target models. If the step size, Δ, used to modify the weights is small enough, the ensemble of source and target models approximates the optimal weight combination [11]. However, if the step size is too small, it may take substantial time for the weights to update to their desired values, making predictions unreliable during this period.

The field of online TL relates to Online Multi-task Learning (OMTL) [22, 25, 26], and Multistream Regression (MSR) [14]. MSR can be seen as a special case, where the source and target data streams are drawn from the same underlying distribution, and all concepts encountered in the target domain have previously been encountered in the source [6]. This means the models transferred from the source can be used to make predictions in the target without requiring a target learner. This is unrealistic for many real-world applications as although source and target domains may be similar, it is unlikely the data streams are drawn from the same distribution. The goal of OMTL is to minimise the cumulative global loss across all domains [21], whereas online TL aims to minimise the predictive losses within each individual domain. Considering loss in this way is beneficial when applied to tasks such as application personalisation, where each domain represents a different user, and prediction errors should be minimised for that specific individual.

Although online TL has been actively studied [10, 12, 31, 32, 35, 36], existing approaches assume the source is offline. We propose BOTL, which considers both source and target in online environments, as might be expected in real-world applications such as smart home heating, or vehicle personalisation such as Adaptive Cruise Control (ACC).

3 Problem formulation

Let domain D consist of a feature space χ, where \(x_{t} \in \mathbb {R}^{m}\) is the instance observed at time t such that \(x_{t} = \{x_{t_{1}},\dots ,x_{t_{m}}\} \in \upchi \). Given domain D, a task consists of the target response variable, yY, where \(y \in \mathbb {R}\), and a regression function, \(f:\upchi \rightarrow Y\), which is learnt to map observed data to the target concept [24]. The knowledge learnt in a source domain, DS, can be transferred to the target domain, DT, and used to enhance predictions [30].

Online TL aims to learn the target predictive function, fT, that effectively predicts the response variable, \(y_{t}^{T} \in Y^{T}\), for each instance, \(x_{t}^{T} \in {{\upchi }^{T}\!}\), observed in the target data stream, such that \({\hat {y}_{t}^{\prime {{T}}_{i}}} = {f^{T}_{i}\!}(x_{t}^{T})\). Model transfer is used to enhance the target predictor by combining knowledge learnt in the local domain with knowledge learnt from other domains. For example, if we consider the scenario of application personalisation, where each domain represents an individual user, each instance, xt, may describe the user’s current environmental setting. If we wish to personalise application functionality by predicting some unknown value, yt, we may be able to utilise knowledge learnt from another user, \({f^{S}_{j}}\), to enhance the predictive performance of the target learner. Identifying concept drift in the source, S, allows models to be transferred, \({f^{S}_{j}}\) where \(j=1{\dots } k\), for each of the k concepts encountered in S.

BOTL aims to minimise the predictive error in the target domain by combining knowledge learnt from the target data stream with models previously learnt in a source domain. Focusing on minimising the loss with respect to the local, or target, domain makes BOTL highly applicable to the task of application personalisation, where predictions are made to benefit a specific individual. To achieve this, if we have a source domain, DS, that has previously learnt models \({f^{S}_{1}\!},\dots ,{f^{S}_{j}}\), and a target domain, DT, that has previously learnt models \({f^{T}_{1}\!},\dots ,{f^{T}_{i}\!}\), at time t, then models \({f^{S}_{1}\!},\dots ,{f^{S}_{j}}\), should be made available to the target domain such that the target learner can benefit from the knowledge learnt in the source domain, DS. As both domains are online, and knowledge transfer is bi-directional, the models \({f^{T}_{1}\!},\dots ,{f^{T}_{i}\!}\) should also be made available to the source domain, DS, such that the source learner can benefit from the knowledge learnt in the target domain, DT.

In this paper, the source and target domains are considered to be homogeneous, such that they share the same underlying feature space, χS = χT, and YS = YT. Although the domains are homogeneous, the underlying concepts to be learnt within source and target domains may not be equivalent; therefore, models from a source domain may not be relevant to the current target concept. BOTL provides a mechanism to combine models and maximise the impact of transferred models on the target. In presenting BOTL, we use the notation detailed in Table 1.

Table 1 Notation

4 Bi-directional Online Transfer Learning

To utilise knowledge of distinct concepts, BOTL hinges upon a sliding window-based concept drift detection algorithm. BOTL uses drift detection strategies that employ batch learners to create base models, \({f^{}_{i}\!}\), from a window of data within a domain. With small windows, batch learners are susceptible to overfitting; however, larger window sizes can cause a reduction in sensitivity to gradual drifts [9]. Alternatively incremental, or online, learners can be used. However, during periods of gradual drifts, data belonging to a new concept may be used to incrementally update the base learner, preventing a drift from being detected. This is problematic for the BOTL framework as a pair of consecutive concepts present in one domain may not exist in another domain, meaning transferred models may be less effective than if they were learnt using data from individual concepts. In this paper, we use three concept drift detection algorithms, RePro [34], ADWIN [3], and a novel drift detection algorithm, AWPro, detailed in Section 7.

Although BOTL uses knowledge learnt from other domains to improve the predictive performance of the target learner, concept drift detection is conducted solely using the locally learnt model. Conducting drift detection independently of any knowledge transfer is necessary as the use of transferred knowledge may enhance the predictive performance across the current window of target data, hindering drift detection.

A common challenge encountered by TL frameworks is negative transfer [24], which occurs when an ineffective model is transferred between domains. To address this, BOTL adopts the notion of model stability, introduced by Yang et al. [33], to determine if a locally learnt model should be transferred to other domains. Yang et al. deem a model to be stable if it has been learnt across 2Wmax instances; however, this does not guarantee that the model is able to make good predictions in the local domain. Therefore, BOTL only considers a model to be stable if it has been used to make predictions across 2Wmax instances without a drift being detected. Unstable models are not transferred, preventing them from negatively impacting the target predictor. Defining model stability in this way prevents BOTL from transferring models that have been learnt from short, noisy periods of data, for example, during drifting periods as one concept changes to another. Once a model is considered to be stable, it is transferred to other domains to aid their respective predictors, as shown in Algorithm 1. This means that the models transferred by BOTL are limited to those that have successfully learnt a concept in their local domain.

figure a

Knowledge transfer is achieved in BOTL by communicating models across domains. When model \(f^{S}_{j}\) is received from a source domain, it is added to the set of transferred models, M, and combined with the target predictor, \(f^{T}_{i}\), to enhance the overall predictive performance. Our instantiation of BOTL uses an Ordinary Least Squares (OLS) regressor as a meta learner to combine the available models such that the squared error of the predicted values, \(\hat {y}\), across W is minimised. Other regression meta-learners that are less prone to overfitting on small windows of data, such as Ridge Regression [4], could be used in place of OLS. In this paper, we have chosen to use OLS as the meta-learner as it does not require additional parameters which would have to be determined from domain expertise or parameter tuning prior to learning in each data stream.

Each transferred model, \({{f^{S}_{j}} \in {M}}\), and the current target model, \(f^{T}_{i}\), are used to generate a new window of data. Each sample \(x_{t}^{\prime }\) in the newly generated window of data is of the form \(x_{t}^{\prime }\! =\{{\hat {y}_{t}^{\prime {{S}}_{1}}},\dots ,{\hat {y}_{t}^{\prime {{S}}_{k}}},{\hat {y}_{t}^{\prime {{T}}_{i}}}\}\), where \({\hat {y}_{t}^{\prime {{S}}_{j}}}\) for all \(j = 1,\dots ,k\) is the predicted value of source model \({f^{S}_{j}}\) on instance xt from the original window of target data, and \({\hat {y}_{t}^{\prime {{T}}_{i}}}\) is the predicted value of the locally learnt target model, \(f^{T}_{i}\), learnt using the underlying concept drift detection algorithm for the current concept, ci. This window of model predictions is used by the OLS meta-learner to obtain the overarching predictive function:

$$ \begin{array}{@{}rcl@{}} {\hat{y}_{t}} & = &{F^{{M}}}(x_{t}^{\prime}) \\ & = &w_{0}+\left( \sum\limits_{j=1}^{k}w_{j}{f^{S}_{j}}(x_{t})\right) +w_{(k+1)}{f^{T}_{i}\!}(x_{t}). \end{array} $$
(1)

As the OLS meta-learner is prone to overfitting when the window size is small and the number of base learners is large, the BOTL framework only uses the current target model, \(f^{T}_{i}\) as input to the meta-learner. Other historical models learnt within the data stream are excluded from the meta-learning process as the underlying concept drift detection strategy deems the current target model, \({f^{T}_{i}\!}\), to be the most relevant with respect to the current concept.

4.1 Bi-directional transfer

BOTL considers the scenario where all domains are online, therefore distinctions between source and target can be disregarded. In this paper, BOTL conducts peer-to-peer model transfer, allowing knowledge transfer to enhance the predictive performances of all domains. When a newly learnt model is stable, it is transferred to all other domains in the framework, and each domain updates its model set, M, when a concept drift is encountered.

Real-world applications, such as smart home heating system personalisations, may be comprised of a large number of domains, rapidly increasing the number of models to be transferred as the number of domains grow. Such applications can suffer in predictive performance due to the curse of dimensionality, where the number of input features to the OLS meta learner becomes large in comparison with the window size [8]. To combat this, we introduce culling to BOTL, referred to as BOTL-C.

4.2 Model culling

Culling transferred models from the model set, M, helps prevent the OLS meta-learner overfitting when a large number of models have been transferred and only a small window of data is available. This could be achieved by limiting the maximum number of models used by the meta-learner. However, due to the dynamic nature of the online environment, the maximum number of models that will prevent the meta-learner overfitting cannot be known in advance. Therefore, a conservative estimate would have to be made, requiring additional domain expertise. Using a conservative estimate may prevent beneficial transferred knowledge from being used to aid the target predictor.

Alternatively, transferred models can be evaluated on the current window of data in order to discard transferred models that are considered to be the least beneficial to the target learner. We achieve this by introducing two variants of BOTL-C. Firstly, BOTL-C.I reduces the number of models available to the OLS meta-learner by temporarily removing transferred models from the model set, M, when their R2 performance across the current window of data drops below a threshold, λcperf. These models can be considered to be the least beneficial to the target learner as they achieve poor predictive performance on the current window of data. Culled models are re-added to M when a concept drift is encountered to enhance predictions of future concepts in the target domain. Although this method of culling is naïve, it can reduce the impact of negative transfer.

In scenarios with high volumes of model transfer, BOTL-C.I requires a high λcperf to sufficiently reduce the number of models to prevent the OLS meta-learner overfitting. This can be detrimental as a high proportion of the transferred models containing useful information are culled and no longer available to enhance the predictive performance of the target learner. To overcome this, BOTL-C.II, outlined in Algorithm 2, evaluates transferred models based on both performance and diversity, metrics commonly used in ensemble pruning [37]. Initially, BOTL-C.II reduces the impact of negative transfer by culling models that achieve an R2 performance less than λcperf, on W. A low λcperf value is preferred, ensuring transferred models containing some useful information are retained. Using a low threshold may not sufficiently reduce the model set, M, to prevent overfitting; therefore, a second round of culling is performed based on model diversity. BOTL-C.II measures the diversity between transferred models using Mutual Information (MI). MI allows models that obtain similar predictions on the current window of data to be identified [7]. If two transferred models have a high MI, using both models in the meta-learning process will provide little benefit to the target learner as a high MI indicates the predictions of the two models on the current window of data are highly correlated. Therefore, no additional knowledge is provided to the target learner by keeping both in the model set. If two transferred models have a MI greater than λcMI, BOTL-C.II culls the model that performs worse. This enables redundant models to be removed from the model set, helping to prevent the OLS meta-learner overfitting. A high λcMI should be selected as the window of locally available data is often small; therefore if a complex concept is to be learnt, the target learner may benefit from utilising knowledge transferred from similar concepts. However, if this threshold is too high, the model set, M, will not be reduced sufficiently to prevent overfitting.

figure b

Culling thresholds, λcperf and λcMI, could be updated as the data stream progresses using cross-validation, allowing alternative culling parameter values to be compared during the meta-learning process. However, due to the online nature of the data streams, instances within a window cannot be considered independent; therefore, the i.i.d. assumption cannot be made [13], as three consecutive instances in the window, xt− 1, xt, and xt+ 1, are likely to be dependent. Therefore, any validation set created from the window has some dependence on the training set. This can cause cross-validation to provide an overestimate of the performance of culling parameters. Additionally, using cross-validation for this purpose would require pk models to be trained and validated every time the meta-learner is updated, where p is the number of culling parameter values compared, and k is the number of folds. As the BOTL framework is to be used in domains with concept drifts, the meta-learner must be updated regularly. Therefore, the use of cross-validation would significantly increase the computation and storage requirements of the BOTL framework, while overestimating the performance of the culling parameters considered. This would limit the use of BOTL in applications that require on-device learning, or have limited computational resources. Within this paper, naïve culling approaches are used where the values of culling parameters are chosen in advance, mitigating the need for cross-validation.

4.3 Initialisation

For any underlying concept drift detection algorithm, an initial window of data, W, is required to create the first predictive model, \({f^{T}_{1}\!}\). Prior to obtaining this data, no predictions can be made as no local knowledge has been learnt. BOTL allows models transferred from other domains to be used to make predictions during this period. Models transferred are initially weighted equally to obtain:

$$ {\hat{y}_{t}} = \frac{1}{|{M}|}\sum\limits_{j=1}^{|{M}|}{f^{S}_{j}}(x_{t}). $$
(2)

Before the first target model, \({{f^{T}_{1}\!}}\), has been learnt and only a small amount of data has been observed, the OLS regressor can create a model, FM, using only source models, \({f^{S}_{j}}\). This approach is prone to overfitting due to the small amount of data available but may be preferred over making no predictions or using Eq. 2 over the entire initial window of data.

The BOTL-C variants help reduce overfitting within this initial period; however, as the amount of available data is small, all transferred models may have R2 performances below the culling threshold. In this scenario, both BOTL-C variants select the best k transferred models, where k < |W|, regardless of the culling threshold. In this paper, we select k = 3.

5 BOTL loss

Theorem 1

BOTL has a squared loss less than or equal to the model learnt locally using a concept drift detection algorithm with no knowledge transfer:

$$ {\mathscr{L}({{f^{T}_{i}\!}})} \geq {\mathscr{L}({F^{M}})}, $$
(3)

where \({{{\mathscr{L}}({{f^{T}_{i}\!}})}}\) denotes the squared loss of the local model, \(f^{T}_{i}\), created using a concept drift detection algorithm, and \({{\mathscr{L}}({F^{M}})}\) is the squared loss of the OLS meta-learner, FM, created using the set of k models transferred from the source, \(\{{f^{S}_{1}\!},\dots ,{f^{S}_{k}\!}\}\) and the current target model, \(f^{T}_{i}\).

Proof

We measure loss over the local window of data, W, using the mean squared error of predictions:

$$ {\mathscr{L}({\cdot})} = \frac{1}{|{W}|}\sum\limits_{t=1}^{|{W}|}\left( {y_{t}}-{\hat{y}_{t}}\right)^{2}, $$
(4)

where yt is the response variable for instance xt, and \({\hat {y}_{t}}\) is the predicted value. If no transfer is used, the local model, \({f^{T}_{i}\!}\), is used to predict \({\hat {y}_{t}}\) for each instance xt such that \({\hat {y}_{t}} = f^{T}_{i}(x_{t})\).

BOTL uses the set of models, M, to obtain predictions \({\hat {y}_{t}^{\prime {{T}}_{i}}}\) and all \({\hat {y}_{t}^{\prime {{S}}_{j}}}\) for instance xt, using the locally learnt model, \({f^{T}_{i}\!}\), and each of the j transferred model, \({{f^{S}_{j}}\in \{{f^{S}_{1}\!},\dots ,{f^{S}_{k}\!}\}}\), respectively.

Predictions are used to create a meta-instance, \(x_{t}^{\prime }\), which the OLS meta-learner, FM, uses to obtain an overarching prediction:

$$ \begin{array}{@{}rcl@{}} {\hat{y}_{t}}&=& {F^{{M}}}\left( x_{t}^{\prime}\right)\\ & =& {F^{{M}}}\left( \langle{f^{S}_{1}\!}(x_{t}),\dots,{f^{S}_{k}\!}(x_{t}),{f^{T}_{i}\!}(x_{t})\rangle\right)\\ & =& {F^{{M}}}\left( \langle{\hat{y}_{t}^{\prime{{S}}_{1}}},\dots, {\hat{y}_{t}^{\prime{{S}}_{k}}},{\hat{y}_{t}^{\prime{{T}}_{i}}}\rangle\right), \end{array} $$
(5)

where

$$ {F^{{M}}}\left( x_{t}^{\prime}\right) = w_{0} + \sum\limits_{j=1}^{j=k}w_{j}{\hat{y}_{t}^{\prime{{S}}_{j}}} + w_{(k+1)}{\hat{y}_{t}^{\prime{{T}}_{i}}}. $$
(6)

Weights \(w_{0},\dots ,w_{(k+1)}\) are assigned to each prediction, \({\hat {y}_{t}^{{}^{\prime }n}}\), for each model n in M, where |M| = (k + 1), to obtain an ensemble prediction, \({\hat {y}_{t}}\), for instance xt by solving the optimisation problem that minimises the squared error of FM:

$$ \min\limits_{w_{0},\dots,w_{(k+1)}}\sum\limits_{t=1}^{|{W}|} \left( {y_{t}} - \left( w_{0}+\sum\limits_{j=1}^{j=k}w_{j} {\hat{y}_{t}^{\prime{{S}}_{j}}} +w_{(k+1)}{\hat{y}_{t}^{\prime{{T}}_{i}}}\right)\right)^{2}. $$
(7)

FM is used to make predictions, \({\hat {y}_{t}}\), for instance xt, using Eq. 6. Using Eq. 4, we can rewrite the loss of FM as:

$$ {\mathscr{L}({F^{M}})}=\frac{1}{|{W}|}\sum\limits_{t=1}^{|{W}|}\left( {y_{t}}- \left( w_{0}+\sum\limits_{j=1}^{j=k}w_{j}{\hat{y}_{t}^{\prime{{S}}_{j}}} +w_{(k+1)}{\hat{y}_{t}^{\prime{{T}}_{i}}}\right)\right)^{2}. $$
(8)

If we constrain the optimisation problem in Eq. 7 to obtain the meta-learner FM by fixing the weights, wa, such that the weight associated with the locally learnt model \({f^{T}_{i}\!}\) is 1, while all others are 0, we obtain a meta-model of the form:

$$ {F^{M}}^{\ast}\left( x_{t}^{\prime}\right) = \left( 0+\sum\limits_{j=1}^{j=k}0{\hat{y}_{t}^{\prime{{S}}_{j}}} +1{\hat{y}_{t}^{\prime{{T}}_{i}}}\right), $$
(9)

giving the loss function:

$$ {\mathscr{L}({{F^{M}}^{\ast}})} = \frac{1}{|{W}|}\sum\limits_{t=1}^{|{W}|}\left( {y_{t}}- {\hat{y}_{t}^{\prime{{T}}}}\right)^{2}, $$
(10)

equivalent to only using the locally learnt model, \({{\mathscr{L}}({{F^{M}}^{\ast }})} = {{\mathscr{L}}({{f^{T}_{i}\!}})}\). As the optimisation problem in Eq. 7 is convex:

$$ {\mathscr{L}({{F^{M}}^{\ast}})} \geq {\mathscr{L}({F^{M}})}. $$
(11)

Finally, as the constrained optimisation problem in Eq. 9 is equivalent to using only the locally learnt model, \({f^{T}_{i}\!}\), the loss of BOTL is less than or equal to the loss of the locally learnt model. □

6 Experimental set-up

Many benchmark datasets have been created to evaluate concept drift detection algorithms [16, 27, 28]; however, most are categorically labelled. In order to evaluate BOTL in a regression setting, we present a modification to the benchmark drifting hyperplane dataset [20]. Additionally, a simulation of a smart home heating system was created using data from a UK weather station to derive desired heating temperatures for a user. The use of such data enables BOTL to be evaluated on data streams containing drifts that are typical within real-world environments. Finally, we evaluate the performance of BOTL using a following distance dataset,Footnote 2 created from vehicular data, and used to predict the Time To Collision (TTC). An overview of the dataset characteristics is shown in Table 2.

Table 2 Dataset characteristics

6.1 Drifting hyperplane

For this benchmark data generator, an instance at time t, xt, is a vector, \(x_{t} = \{ {x_{t_{1}}}, {x_{t_{2}}},\dots ,{x_{t_{n}}}\}\), containing n randomly generated, uniformly distributed, variables, \({x_{t_{n}}}\in [0,1]\). For each instance, xt, a response variable, yt ∈ [0,1], is created using the function \({y_{t}} = ({x_{t_{p}}}+{x_{t_{q}}}+{x_{t_{r}}})/3\), where p, q, and r reference three of the n variables of instance xt. This function represents the underlying concept, ca to be learnt and predicted. Concept drifts are introduced by modifying which features are used to create y. For example, an alternative concept, cb, may be represented by function \({y_{t}} = ({x_{t_{u}}}+{x_{t_{v}}}+{x_{t_{w}}})/3\), where {p,q,r}≠{u,v,w} such that cacb. We introduce uniform noise, ± 0.05, by modifying yt for each instance xt with probability 0.2.

A variety of drift types have been synthesised in this generator including sudden drift, gradual drift and recurring drifts. A sudden drift from concept ca to concept cb is encountered immediately between time steps t and t + 1 by changing the underlying function used to create yt and yt+ 1. A gradual drift from concept ca to cb occurs between time steps t and t + m, where m instances of data are observed during the drift. Instances of data created between t and t + m use one of the underlying concept functions to determine their response variable. The probability of an instance belonging to concept ca decreases proportionally to the number of instances seen after time t while the probability of it belonging to cb increases as we approach t + m. Recurring drifts are created by introducing a concept cc that reuses the underlying function defined by a previous concept, ca, such that we achieve conceptual equivalence, cc = ca. Datasets generated in this way, containing uniform noise, are denoted by SuddenA and GradualA for sudden and gradual drifting data streams respectively.

We also create variations of the drifting hyperplane datasets that introduce problems that may be encountered when using BOTL in real-world environments. The first variation simulates sensor failure. In this scenario, a feature vector, i, is set to 0 from time t for the remainder of the data stream with probability 0.001, such that \({x_{t_{i}}} = 0\). In the scenario where feature i is used to create the response variable y, we modify two other feature vectors, j and k, such that \({x_{t_{j}}} = {x_{t_{i}}}/4\) and \({x_{t_{k}}} = 3{x_{t_{i}}}/4\). This ensures that the underlying concept can still be learnt from the data. We denote datasets generated in this way as SuddenB and GradualB for sudden and gradual drifting data streams.

The second variation simulates intermittent sensor failure, where once the feature vector i has been selected to fail, the feature value at time step t is set to 0 such that \({x_{t_{i}}} = 0\) with probability 0.3. Datasets generated using this scenario are denoted as SuddenC and GradualC for sudden and gradual drifting data streams respectively.

The third variation introduced emulates the deterioration of a sensor. Sensor deterioration is captured by including noise depending on the time step t such that \({x_{t_{i}}} = {x_{t_{i}}} \pm (0.2(t/|\upchi |))\), where 0.2 is the maximum amount of noise added to instance \({x_{t_{i}}}\) and |χ| is the number of instances in the dataset. This means that as the data stream progresses, more noise is added to an individual feature, simulating the gradual deterioration in accuracy of a sensor over time. Additionally, the probability of a sensor deteriorating increases as the data stream progresses, such that the probability of a feature being selected for deterioration at time t is 0.001(t/|χ|). Datasets generated in this way are denoted as SuddenD and GradualD for sudden and gradual drifting data streams respectively.

6.2 Heating simulation

A simulation of a smart home heating system was created, deriving the desired room temperature of a user. Heating temperatures were derived using weather data collected from a weather station in Birmingham, UK, from 2014 to 2016. This dataset contained rainfall, temperature and sunrise patterns, which were combined with a schedule, obtained from sampling an individual’s pattern of life, to determine when the heating system should be engaged. The schedule was synthesised to vary the desired heating temperature based on time of day, day of week and external weather conditions, creating complex concepts. To create multiple domains, weather data was sampled from overlapping time periods and used as input to the synthesised schedule to determine the desired heating temperatures. Due to the dependencies on weather data, each stream was subject to large amounts of noise. Concept drifts were introduced manually by changing the schedule; however, drifts also occurred naturally due to changing weather conditions. By sampling weather data from overlapping time periods, and due to seasonality, data streams follow similar trends, ensuring predictive performance can benefit from knowledge transfer. By using complex concepts, dependent on noisy data, the evaluation of BOTL on this data is more indicative of what is achievable when used in real-world environments.

6.3 Following distance

This dataset uses a vehicle’s following distance and speed to calculate TTC when following another vehicle. Vehicle telemetry data such as speed, gear position, brake pressure, throttle position and indicator status, alongside sensory data that infer external conditions, such as temperature, headlight status and windscreen wiper status, were recorded at a sample rate of 1 Hz. Additionally, some signals such as vehicle speed, brake pressure and throttle position were averaged over a window of 5 seconds to capture a recent history of vehicle state. Vehicle telemetry and environmental data can be used to predict TTC and used to personalise vehicle functionalities such as ACC by identifying the preferred following distance, reflecting current driving conditions. Data was collected from 4 drivers for 17 journeys which varied in duration, collection time and route. Each journey is considered to be an independent domain and BOTL enables knowledge to be learnt and transferred across journeys and between drivers. Each data stream is subject to concept drifts that occur naturally due to changes in the surrounding environment such as road types and traffic conditions.

7 Drift detection strategies

BOTL relies on a concept drift detection algorithm, local to each domain. Although any sliding window-based drift detection strategy can be used, it is desirable for the chosen strategy to learn as few models as possible to represent concepts in the data stream. Limiting the number of models learnt in each domain reduces the number of input features to the OLS meta-learner, helping to prevent overfitting caused by the curse of dimensionality [8]. Additionally, reducing the number of models needing to be transferred across domains reduces the communication and computational overhead of combining knowledge, which may impact the feasibility of using BOTL in real-world applications that require on-device learning.

Drift detection strategies that employ single model-based approaches, instead of ensemble techniques, reduce the number of models used to represent a single concept. If an ensemble-based drift detector was used, such as DWM [19], or Adaptive Windowing Online Ensemble (AWOE) [29], the knowledge learnt to represent a single concept may be encompassed across multiple models in the ensemble. Therefore, all models, and their ensemble weights, would need to be transferred across domains. Additionally, the number of models learnt in each domain can be reduced by allowing the reuse of previously learnt models when concepts reoccur [33]. To achieve this, a history of models can be retained to prevent redundant models being learnt locally, and transferred across domains.

We consider three concept drift detection techniques to underpin BOTL. Firstly, we adapt RePro [34] to a regression setting; secondly we apply ADWIN [3]; and thirdly we propose a new concept drift detection algorithm, Adaptive Windowing with Proactive drift detection (AWPro) which combines elements of RePro and ADWIN. Each of these drift detection strategies are dependent on user-defined parameters, which require domain expertise to select appropriate values. Within this section, we discuss the impact of user-defined parameters on each concept drift detection strategy, and how this effects their applicability as the drift detection algorithm within domains for BOTL. To highlight this, we evaluate RePro, ADWIN and AWPro on synthetic drifting hyperplane datasets, generated with uniform noise, containing sudden and gradual drifts, simulated smart home heating data, and real-world vehicle following distance data.

7.1 RePro

We consider an adaptation of RePro [34] for regression as an underlying drift detector. Although RePro requires domain expertise to select appropriate parameter values, including window size, Wmax, drift threshold, λd, and loss threshold, λl, it encapsulates key characteristics that allow few models to be learnt in each domain. RePro is a sliding window-based detection algorithm that learns a single model for the current concept [33]. Additionally, RePro prioritises the reuse of existing models over learning new models by retaining a history of previously learnt models, HT, and concept transitions, TMT, to proactively determine which concept is likely to occur next [34].

RePro was initially developed specifically for classification tasks; therefore, modifications are required for regression settings, shown in Algorithm 3. The original RePro algorithm detects drifts by measuring the target models’ classification accuracy across the sliding window, W. When the classification accuracy drops below an error threshold a drift is detected [34]. If the window is full, |W| = Wmax, but the classification accuracy does not drop below the error threshold, the sliding window is maintained by discarding one incorrectly classified instance, and all subsequent correctly classified instances. To apply RePro to regression, the sliding window must be maintained; however, the notion of a correctly classified instance must be altered as small inaccuracies are inevitable in regression settings due to noise. To overcome this, ε-insensitivity can be used, allowing for a small margin of error between the prediction and response variable. To maintain a sliding window (lines 11–14), we introduce a loss threshold, λl that allows instance x(t−|W|) to be discarded from the window if the predicted value, \({\hat {y}_{(t-|{W}|)}^{\prime }}\) satisfies:

$$ |{\hat{y}_{(t-|{W}|)}^{\prime}}-{y_{(t-|{W}|)}}| \leq {\lambda_{l}}. $$
figure c

The R2 performance of the target model, \({f^{T}_{i}\!}\), across W is used to detect drifts (line 8). A drift is said to have occurred when the performance of the target model drops below a predefined drift threshold, λd, akin to observing the classification accuracy dropping below an error threshold.

figure d

The original formulation of RePro used the notion of a stable learning size, specifying how much data is required to learn a stable model. This was necessary for the simulated classification tasks presented by Yang et al. [34] as small window sizes were required to allow drifts to be detected quickly. However, this meant that insufficient instances were available in the window to learn a model that adequately represented the current concept [33]. Yang et al. suggest a stable learning size of 3Wmax [34]. As real-world environments are often considerably more noisy than simulated or synthetic environments, using a small window size can cause drifts to be falsely detected; therefore, a larger window size is necessary [3]. Increasing the window size also increases the stable learning size; however, if the stable learning size is increased, the data used to create a model may encapsulate multiple underlying concepts. To overcome this challenge, our adaptation of RePro for regression defines a stable model to be one that is learnt from Wmax instances and is used to make predictions over 2Wmax instances without a drift being detected.

To proactively determine future concepts, RePro maintains a transition matrix, TMT, to determine the likelihood of encountering a recurring concept. To prevent the reuse of unstable models that make poor predictions, only those that are considered to be stable are added to the transition matrix. If the transition matrix indicates that it is equally likely that two or more concepts may be encountered next, RePro evaluates the performance of each model on the current window of data and selects the model with the highest accuracy. If the transition matrix does not indicate a likely successor concept, each historical model is considered for reuse. A new model is only learnt when all historical models perform worse than the drift threshold λd, as shown in Algorithm 4.

figure e

7.1.1 Parameter selection

The characteristics of RePro as a drift detection strategy are desirable for BOTL; however, the selection of parameter values, Wmax, λd and λl, may not be intuitive. Given an online data stream, trade-offs must be considered for each parameter. For example, selecting a large window size, Wmax, allows more data to be retained to build local target models, increasing their accuracy and stability [2]. However, a window size that is too large may cause RePro to react slowly to concept drifts, and may retain data from multiple concepts, preventing a model from being created to represent each concept independently. Alternatively, using a small window size may allow RePro to react quickly to drifts as a smaller window size may encapsulate a sample of data that is more representative of the current distribution of the data stream. However, selecting a window size that is too small may prevent a representative sample being retained to build a model that effectively represents the current concept, reducing the overall performance of RePro.

The drift and loss thresholds, λd and λl, determine RePro’s sensitivity to concept drift. Small drift thresholds and large loss thresholds decrease RePro’s sensitivity to concept drifts as small λd values allow the performance of a model to greatly decrease before a drift is detected, while large λl values allow instances to be removed from the sliding window while a model’s predictive error is high.

Large λd and small λl values increase RePro’s sensitivity to drifts. As RePro becomes more sensitive to concept drifts, it also becomes more likely that the window of data, W, used to build a model for the newly encountered concept contains instances belonging to the previous concept. Models built using data belonging to both the previous, and new, concepts exhibit high predictive errors. As RePro monitors the model performance to detect drifts, this can cause RePro to repeatedly detect drifts and create unstable models immediately after a concept drift, and during periods of gradual drift. The repeated creation of unstable models increases computation; however, these models are not added to the transition matrix, TMT, or the model history, HT, and therefore do not greatly impact the overarching performance of RePro across the data stream and do not impact the communicational overhead of knowledge transfer. Due to this, selecting values for λd that are too large, and values for λl that are too small, may prevent RePro from creating stable models that can be added to the model history where drifts are falsely detected in the presence of noise.

7.2 ADWIN

ADWIN, presented by Bifet et al. [3], detects drifts by monitoring changes in the distribution of a data stream. For use in a regression setting, the distribution of predictive error is monitored across a sliding window. Instead of using a fixed length sliding window, the size of the window is determined according to the rate of change observed in the online data stream [3].

ADWIN operates on the principal that if two large enough sub-windows have distinct enough means, the expected values within each sub-window will differ [3, 9]. A drift is said to be detected when:

$$ |{{{\hat{\mu}}_{{W}_{0}}}}-{{{\hat{\mu}}_{{W}_{1}}}}| \geq \varepsilon_{\text{cut}}, $$
(12)

where \({{\hat {\mu }}_{{W}_{0}}}\) and \({{\hat {\mu }}_{{W}_{1}}}\) are the means of sub-windows W0 and W1, and εcut is defined by the Hoeffding bound:

$$ \varepsilon_{\text{cut}} = \sqrt{\frac{1}{2m}\cdot\ln\frac{4|{W}|}{{\delta}}}, $$
(13)

where m is the harmonic mean of the sub-windows, \(m = \frac {2}{1/|{W}_{0}|+1/|{W}_{1}|}\), and δ is a confidence value, defined by the user, which determines the sensitivity of drift detection [3, 9].

ADWIN can be used by BOTL, as presented in Algorithm 5, to detect drifts within the data stream by monitoring the distribution of the predictive error of the locally learnt model, \({f^{T}_{i}\!}\):

$$ |{f^{T}_{i}\!}(x_{t})-{y_{t}}|. $$
figure f

Once a drift is detected, a new model is learnt locally, \({f^{T}_{i+1}\!}\), using the second sub-window such that W = W1 (lines 9–15). Monitoring the distribution of predictive error allows drifts to be detected rapidly. However, if two consecutive concepts are dissimilar, drifts are frequently detected when only a small number of instances from the new concept have been observed; therefore, the data contained in the sliding window, W, after a drift is detected is unlikely to be representative of the new underlying concept. To address this, our implementation of ADWIN for BOTL only creates a new model, \({f^{T}_{i+1}\!}\), once Wmin instances belonging to the new concept have been observed, such that |W| = Wmin (lines 12–15). Until sufficient data has been observed to build a model that adequately represent the new concept, the previously learnt model, \({f^{T}_{i}\!}\), must continue to be used to make predictions.

7.2.1 Parameter selection

Using ADWIN in this way requires two user-defined parameters, the minimum window size, Wmin, and the confidence value, δ. Similarly to RePro, domain expertise is required to select these parameter values.

As ADWIN uses a dynamic sliding window, the minimum window size parameter, Wmin, does not directly impact ADWIN’s ability to detect concept drifts. Instead Wmin is only used to determine how much data should be retained to build a model that adequately represents the current concept [2]. Similar to RePro’s Wmax parameter, large values of Wmin allow more data to be made available to the local target learner, \({f^{T}_{i}\!}\), creating an accurate and stable model [2]. However, if Wmin is too large, the data contained within the window may encapsulate data from multiple concepts, preventing individual models being learnt for each concept. Small values of Wmin ensure the data used to build a model is representative of the current distribution of data.

ADWIN detects drifts by monitoring the distribution of predictive error; therefore, Wmin can indirectly effect ADWIN’s ability to correctly detect concept drifts and overarching performance. If Wmin is too small, it may cause a model to be learnt that overfits and has high predictive error. As ADWIN monitors the change in distribution of predictive error, using a model that initially has a high predictive error may prevent or delay the detection of a concept drift as no significant change in the distribution of predictive error is observed during periods of drift. However, large Wmin values increase the number of instances observed before a new model can be learnt, prolonging the use of the previously learnt model, decreasing the overarching predictive performance of ADWIN across the data stream.

The confidence value, δ, is used to determine ADWIN’s sensitivity to concept drifts through the εcut threshold (Eqs. 12 and 13). High values of δ increases drift sensitivity; however, in noisy data streams, this can cause drifts to be falsely detected due to the increased variability of sub-window means, \({{{\hat {\mu }}_{{W}_{0}}}}\) and \({{{\hat {\mu }}_{{W}_{1}}}}\). To overcome this, lower values of δ can be chosen; however, this may prevent drifts from being detected in domains containing similar consecutive concepts, or slow gradual drifts, where the sub-window means do not change greatly.

7.3 AWPro

The use of RePro as a concept drift detection algorithm can be computationally demanding due to the creation of high volumes of unstable models, caused by its inability to detect the precise point of drift within the sliding window. ADWIN allows this point to be identified by splitting the sliding window into two sub-windows where the first sub-window contains instances belonging to the old concept, which can be discarded, while the second sub-window contains instances belonging to the new concept. However, if the number of remaining instances in the second sub-window is small, ADWIN must wait until sufficient instances have been observed before a new model can be learnt, negatively impacting the performance of ADWIN. Additionally, ADWIN does not reuse previously learnt models; therefore, models must be re-learnt for recurring concepts. This increases the number of models transferred between domains when ADWIN is used as the underlying concept drift detection algorithm for BOTL, and prevents previously learnt models from being used to make predictions when few instances from a recurring concept have been observed.

To reduce computation from creating unstable models, while also preventing duplicate models being learnt for recurring concepts, we introduce an alternative concept drift detection strategy, AWPro, presented in Algorithm 6, which combines desirable characteristics from ADWIN and RePro that better suit the BOTL framework.

figure g

AWPro uses ADWIN to monitor the change in distribution of predictive error, allowing the drift detection strategy to partition instances belonging to different concepts within a dynamic sliding window. Concept drifts are identified using Eqs. 12 and 13, which use a confidence value, δ, to determine the sensitivity to changes in the distribution of the predictive error (line 10). Once a model is learnt, RePro is used to identify stable models, which are retained in the model history, HT, and the transition between concepts is added to the transition matrix, TMT (lines 25–26). A stable model is one that is used to make predictions over 2Wmin instances without a drift being detected.

When a concept drift is encountered, AWPro drops the first sub-window of instances, W0, such that all instances in the window belong to the new concept, W = W1. If the remaining data in the window is less than \(\frac {1}{2}{W_{{min}}}\) instances, a temporary model is created (lines 13–15). Although these temporary models are akin to unstable models learnt using RePro, the window used to build these models only contains instances belonging to the new concept. Having few instances available increases the likelihood of learning a model that is not representative of the entire concept; however, this may be preferable to ADWIN’s approach of continuing to use a model that represents the previous concept, or RePro’s approach where a model may be learnt from data belonging to both concepts. Once a temporary model has been learnt, incoming instances continue to be added to the window.

If \(\frac {1}{2}{W_{{min}}}\) or more instances have been observed after a drift, the proactive nature of RePro is used by AWPro to determine if an existing model can be reused to represent the current concept (lines 16–20). This allows AWPro to identify an existing model, using Algorithm 7, that has already been learnt to be used for predictions prior to a full window of instances being observed. AWPro uses a recurrence threshold that determines if an existing model can be reused, λr, which acts in that same way as the drift threshold, λd, defined by RePro, when considering the reuse of existing models. If a model’s R2 performance is greater than the recurrence threshold, λr, it is reused, and the process of detecting concept drifts through monitoring changes to the distribution of predictive error is resumed. However, if no model exists, the use of the temporary model continues until Wmin instances of the new concept have been observed.

Finally, if Wmin instances have been observed after a concept drift, and a temporary model is still being used to make predictions, AWPro uses the transition matrix and historical models to identify existing models that could be reused now a more representative sample of data is contained within the window. If no existing model exceeds the recurrence threshold, λr, a new model is learnt using the Wmin most recently observed instances.

7.3.1 Parameter selection

AWPro relies on three user-defined parameters: the confidence value, δ, the window size, Wmin, and the recurrence threshold, λr. As AWPro adopts ADWIN’s approach to detecting concept drifts, many of the challenges of parameter selection specified with respect to ADWIN for the confidence value, δ, and the window size, Wmin, are also applicable to AWPro. However, instead of waiting for Wmin instances to be observed after a drift is detected, AWPro uses the recurrence threshold, λr, to determine if an existing model can be reused.

Parameter λr effects AWPro in two ways. If large values are chosen for λr, the likelihood of reusing a historical model decreases, as a historical model must exhibit low predictive error in order to be selected for reuse. Therefore, high λr values will increase the number of models learnt by AWPro. To increase the reuse of models in the presence of recurring concepts smaller values should be chosen for λr. However, if λr is too small, an existing model may be reused for a concept that has not previously been encountered, lowering the overarching predictive performance of AWPro. This may hinder the detection of concept drifts as the predictive error across the sliding window of data may initially be high, therefore identifying concept drifts through monitoring changes to the distribution of predictive error becomes challenging.

7.4 Impact of parameter values

In order to investigate how BOTL is impacted by parameters defined by each drift detection strategy, we consider the performance of the underlying drift detectors in addition to the number of both stable and unstable models created.Footnote 3 Parameter values should be chosen with the aim of maximising the performance of the underlying drift detector, while reducing both the number of stable and unstable models, therefore minimising unnecessary computation, and reducing communication overheads.

The first parameter considered was the loss threshold, λl, used by RePro, which determines how close a prediction must be to the response variable in order for it to be discarded from the sliding window. As the hyperplane datasets are synthetic, and response variables are in the range [0,1], we used λl = 0.01, allowing for a 1% error in predictions. The heating simulation and following distance datasets do not have a definitive range for their respective response variables; therefore, a percentage of error could not be used. As these are examples of BOTL being used by a user facing application, λl was selected by considering errors that would not be noticeable to a user. For the heating simulation datasets, we used λl = 0.5 as a prediction error of 0.5 C would not be detectable by an individual. Similarly, we used λl = 0.1 for the following distance datasets, allowing for a predictive error of 0.1 s. These loss thresholds are used by RePro throughout the remainder of this paper.

Drift sensitivities and window sizes must also be chosen for RePro, ADWIN and AWPro. To consider how predictive performance and the number of models created are effected by parameter values, we varied both drift sensitivity and window size for each drift detection strategy.

Figure 1 displays results of varying drift sensitivities for each drift detection strategy. These results used a fixed window size for each dataset type, representative of the results presented in Fig. 2, which were obtained by varying window size. Figure 1 uses a window size of 30 instances for hyperplane datasets, 480 instances for heating simulation datasets and 90 instances for following distance datasets.

Fig. 1
figure 1

Performance and number of stable models created by RePro, ADWIN and AWPro with varying drift sensitivities (drift threshold, λd, confidence value, δ, and confidence value, δ, respectively). Annotated with the total number of models learnt and percent that are considered stable. Window sizes of 30 are used for Sudden and Gradual drifting hyperplane datasets, 480 for heating simulation datasets (capturing instances across a period of 10 days), and 90 for following distance datasets

Fig. 2
figure 2

Performance and number of stable models created by RePro, ADWIN and AWPro with varying window sizes, Wmax for RePro and Wmin for ADWIN and AWPro. Annotated with the total number of models learnt and percent that are considered stable. Drift sensitivities of λd = 0.6, δ = 0.02 and δ = 0.02 have been used for RePro, ADWIN and AWPro respectively across all datasets

Figure 1 indicates higher drift sensitivity values typically obtained a higher performance across all drift detectors; however, the number of unstable models was also larger. Lowering the drift sensitivity introduced a slight decrease in performance but significantly reduced the number of unstable models, particularly in the case of RePro, shown in Fig. 1a; therefore, a trade-off between performance and the number of models created is necessary.

If we are concerned solely with the performance of the underlying drift detector, then RePro obtained the best performance overall. RePro, shown in Fig. 1a, outperformed ADWIN, Fig. 1b, and AWPro, Fig. 1c, due to its drift detection mechanism. Unlike ADWIN and AWPro, RePro monitors the predictive performance of the current model, and detects drifts when its performance drops below the drift threshold, λd. This means poorly performing models are replaced as new instances of data are observed, until a model that achieves a performance greater than λd is learnt. ADWIN and AWPro only monitor the distribution of predictive error, regardless of how poorly the model performs; therefore, a poorly performing model will only be replaced when a change in the distribution of predictive error is observed.

Although RePro uses a sliding window to capture the most recent instances, it does not detect the precise point a drift occurs within the window. This means that RePro frequently builds models from windows containing instances belonging to both the previous, and new concept, causing unstable models to be learnt. As RePro monitors the performance of each model, unstable models are often created in quick succession during gradual drifts, or immediately after sudden drifts. This is highlighted in the annotations in Fig. 1, which show the percentage of models that were considered stable and useful for knowledge transfer. RePro created significantly more unstable models, wasting computation that may mean it is infeasible in environments with limited computational resources. ADWIN and AWPro may be more applicable in these environments as they allow the precise point of drift to be identified using a dynamic sliding window, therefore reducing the number of unstable models.

Although ADWIN created fewer unstable models, it cannot reuse existing models in the presence of recurring concepts; therefore, a larger number of stable models were learnt. This is detrimental to BOTL as it increases the communication required for knowledge transfer. Additionally, BOTL relies upon a meta-learner to combine the knowledge transferred across domains. Transferring redundant, or duplicate models, across domains can negatively impact the overarching performance of BOTL as it increases the number of input features to the meta-learner, increasing the likelihood of overfitting caused by the curse of dimensionality [8].

AWPro combines ADWIN’s drift detection strategy with RePro’s ability to prioritise the reuse of existing models. This combats communication overheads and reduces the risk of overfitting introduced by transferring redundant models. Although the percentage of models that were considered stable was higher on average for ADWIN, AWPro created both fewer stable and unstable models due to the prioritisation of reusing existing models when drifts are encountered. This makes AWPro more applicable to environments that require minimised computation and communication overheads.

By considering performance and the number of stable and unstable models learnt, we selected a drift sensitivity value λd = 0.5 for RePro as the performance obtained across all datasets remained high, and the percentage of models learnt that are considered stable increased drastically compared with using drift sensitivity values λd = 0.6 and λd = 0.7.

The percentage of models learnt by ADWIN that are considered stable varied little across drift sensitivity values within the hyperplane datasets; however, a drop in performance was observed when δ > 0.02 and δ > 0.002 for sudden and gradual drifting datasets respectively. Additionally, an increase in the percent of models that were considered stable was observed in the heating simulation datasets at δ = 0.02, while no significant change in performance or number of stable or unstable models was observable in the following distance datasets. Therefore, a drift sensitivity value δ = 0.02 was selected for ADWIN.

As AWPro is based upon the drift detection mechanism used by ADWIN, observations in changes to performance with varying drift sensitivities were similar. To enable fair comparisons, we also used δ = 0.02 for AWPro.

In addition to the impact of drift sensitivity, we considered the selection of an appropriate window size. Figure 2 presents the results obtained when varying window size for each drift detection strategy. This parameter determines how much data is made available to learn a new model in the presence of concept drift, and how much data is retained in order to detect drifts. The results presented in Fig. 2 used a fixed drift sensitivity value for each drift detector, representative of the results presented in Fig. 1, and displayed in Table 3, to enable fair comparison.

Table 3 Window size and drift sensitivity parameters used by RePro, ADWIN and AWPro to obtain results presented in Section 8

Across synthetic sudden and gradual drifting hyperplane datasets, RePro, ADWIN and AWPro (Fig. 2a, b and c respectively) maintained similar ratios of stable to unstable models, regardless of window size; however, the performance of the drift detection strategy typically decreased as the window size increased. This phenomenon is observed due to the presence of simple underlying concepts in the synthetic data streams; therefore, effective predictive models can be learnt from little data. Increasing the window size decreased predictive performance in these datasets as it delayed drift detection. However, a significant increase in performance is observed as the window size increased for each drift detection strategy for the real-world following distance data streams as the concepts to be learnt are more complex, and therefore require more data to be made available to the target learner in order to build a predictive model that effectively represents the current concept [2]. Figure 2 highlights that the performance and ratio of stable to unstable models are significantly affected by the window size. However, all drift detection strategies perform similarly, indicating that the window size is dependent on the data stream to be learnt from, rather than the drift detection strategy. From the results presented in Fig. 2, we selected window sizes of 30 instances for sudden and gradual drifting hyperplane datasets, 480 instances for smart home heating simulation datasets which encapsulates 10 days of observations and 90 instances for following distance datasets which encapsulates 90 s of observations.

AWPro has an additional parameter, λr, which determines if a historical model can be reused. As its functionality is similar to how λd is used by RePro to identify recurring concepts, this parameter value for AWPro has been selected based on the analysis of RePro in Figs. 1 and 2 to allow fair comparisons between the drift detection strategies.

To ensure fair comparisons, we selected a single window size per data stream to be used by all concept drift detection strategies, and a single drift sensitivity value per drift detection strategy to be used across different data streams. The parameter values displayed in Table 3 were used to obtain the results presented in Section 8. Overall, parameter values were selected such that the window size is small to allow swift drift detection, but ensures sufficient data is retained to build an effective predictive model. This was inferred by considering the percentage of models that were considered stable. Drift sensitivity parameters were selected that not only prioritised high performance, but also took into account communicational and computational overheads. This was inferred by considering the number of both stable and unstable models.

8 Experimental results

We compared BOTL, using RePro, ADWIN and AWPro as the underlying concept drift detectors, against each of the concept drift detection strategies with no knowledge transfer, and an existing state-of-the-art online transfer learning (GOTL) framework [12], using the drifting hyperplane, heating simulation, and following distance datasets. BOTL is model agnostic; however, in order to make comparisons between BOTL and existing techniques, all implementations used ε-insensitive Support Vector Regressors (SVRs) as base learners.

The underlying concept drift detection strategies, RePro, ADWIN and AWPro, are used to determine a baseline performance threshold, obtained when no knowledge is transferred [33]. For each of these drift detection strategies, parameter values were chosen based on the discussion outlined in Section 7.4 such that each drift detection strategy aims to balance the trade-off between performance and computational and communication overheads.

GOTL was designed to learn from an offline source; however, as we are considering the implications of both domains being online, we used the underlying concept drift detection strategies to detect individual concepts in the source domain. This is necessary as many online applications cannot retain an entire history of data, preventing a single model from being learnt across the entire data stream. We used the drift detection strategies to identify the model that had been used in the source for the largest proportion of the data stream, and therefore is considered to be the most stable. GOTL transferred this model from the source domain to the target to enhance the effectiveness of the target predictor. A small step size, Δ = 0.025, was chosen, as suggested by Grubinger et al. [11], which slowly modified the weights used to combine source and target models.

When evaluating GOTL, experiments were conducted such that each data stream was paired with every other data stream as source and target domains respectively. Due to only transferring the most stable model when using GOTL, learning in the target domain only commenced once learning in the source domain had completed such that the most stable source model could be identified and transferred to the target domain. Additionally, the performance of GOTL presented in this section takes into account both the performance of the source and target domains, as GOTL requires learning in the source, without knowledge transfer prior to learning in the target, whereas BOTL allows both domains to benefit from knowledge transfer simultaneously.

BOTL combines knowledge via the OLS meta-learner and therefore no additional parameters are required; however, the BOTL-C culling parameters must be defined. We set λcperf = 0 for BOTL-C.I, thereby discarding models that performed worse than the average predictor (R2 < 0). To ensure BOTL-C.II used a more aggressive approach to model culling, we increased λcperf to 0.2. Additionally, as small window sizes were used to enable swift drift detection, we used λcMI = 0.95 to allow knowledge of similar concepts to be retained by the meta-learner to aid predictions of complex concepts. However, models with extremely high mutual information were not both retained, as little to no beneficial knowledge would be provided to the meta-learner if both remained in the model set, M.

When evaluating BOTL and BOTL-C variants, all data streams for a given experiment were used as source domains with bi-directional transfer. Repeat experiments were conducted by randomising the ordering and interval between the commencement of learning in each domain. For the baseline concept drift detection strategies without knowledge transfer, all data streams were learnt from independently.

8.1 Drifting hyperplane

We considered the effectiveness of BOTL on synthetic data created using the drifting hyperplane data generator containing two types of drift: sudden and gradual. We conducted four experiments on each type of drift using the drifting hyperplane data generator, investigating the impact of different types of noise that may be encountered when using BOTL in real-world environments.

Firstly, we used drifting hyperplane datasets containing uniform noise, denoted by datasets SuddenA and GradualA for the sudden drifting, and gradual drifting hyperplane datasets respectively. Secondly, we considered the impact of single sensor failure. Datasets of this nature are denoted as SuddenB and GradualB. Thirdly, we introduced the scenario of intermittent single sensor failure, denoted by SuddenC and GradualC. These datasets allowed us to investigate the use of BOTL within unreliable environments. Finally, we emulated single sensor deterioration by increasing the amount of noise associated with a feature vector throughout the data stream. Datasets containing this variant of sensor failure are denoted by SuddenD and GradualD for sudden and gradual drifting data streams.

For each variant of experiments, six data streams were created for each drift type. Each data stream contained five concepts, occurring four times throughout the data stream, with drifts encountered every 500 time steps. Sudden drifts occurred immediately, and gradual drifts occurred over a period of 100 time steps. Each data stream shared at most three concepts with another domain, ensuring some models transferred were useful to the target learner, while others were not. Data streams were separated such that transfer occurred only between domains of the same drift and noise type.

Tables 4 and 5 present the results obtained by the concept drift detection algorithms, with no knowledge transfer, GOTL and BOTL variants, for the sudden and gradual drifting hyperplane datasets respectively. These results indicate that GOTL obtained slightly poorer performances in comparison with the concept drift detectors without knowledge transfer, despite the most stable source model being transferred to the target domain. Although knowledge transfer was not beneficial to GOTL, at least one of the BOTL variants was able to outperform RePro, ADWIN, AWPro and GOTL with statistical t tests achieving p values < 0.01, highlighting the importance of transferring knowledge of multiple concepts bi-directionally.

Table 4 Drifting hyperplanes: average performance (R2, PMCC2, RMSE) and number of models used by the meta learner (|M|) to make predictions using no knowledge transfer, GOTL, BOTL and BOTL-C variants for six sudden drifting domains, where * indicates p < 0.01 in comparison with RePro and GOTL, and italicised values indicate the highest R2 performance
Table 5 Drifting hyperplanes: average performance (R2, PMCC2, RMSE) and number of models used by the meta learner (|M|) to make predictions using no knowledge transfer, GOTL, BOTL and BOTL-C variants for six gradual drifting domains, where * indicates p < 0.01 in comparison with RePro and GOTL, and italicised values indicate the highest R2 performance

The performance increase of BOTL over GOTL (p < 0.01) on datasets containing uniform noise (SuddenA, GradualA), and intermittent sensor failure (SuddenC, GradualC), can be attributed to the availability of all source models in the target domain. Additionally, GOTL’s step-wise weighting mechanism prevents the influence of a model changing drastically over a small period of time. This means a large amount of data must be observed after each drift to converge on an approximation of the optimal weights. To overcome this, a larger step size could be used; however, this may prevent or hinder convergence. BOTL overcomes this by using the OLS meta learner to minimise the squared error of the combined predictor with instantaneous effect.

The performances of BOTL-C variants were also significantly better than the underlying drift detection algorithms and GOTL on these data streams, obtaining t test values of p < 0.01; however, they performed slightly worse than BOTL. Figure 3 highlights the aggressive nature of the culling techniques used by BOTL-C.I and BOTL-C.II on a sudden drifting hyperplane data stream with uniform noise. It shows BOTL used at least four times more models than BOTL-C variants and highlights correlations between the number of models used and performance. When the number of models used was small, the predictive performance of BOTL-C variants decreased. This performance decrease can be attributed to the aggressive nature of these culling mechanisms. Culling based on model performance alone prohibited the inclusion of a diverse set of models, reducing the overall predictive performance of the meta learner. When BOTL-C variants retained a larger proportion of the transferred models, a performance similar to BOTL was achieved.

Fig. 3
figure 3

Sudden drifting hyperplanes: R2 performance and number of models used by BOTL and BOTL-C variants using two SuddenA data streams where vertical lines indicate concept drifts

However, BOTL and BOTL-C.I were not able to outperform RePro, ADWIN, AWPro or GOTL in the data streams containing single sensor failure (SuddenB, GradualB) and gradual sensor deterioration (SuddenD, GradualD). Although BOTL and BOTL-C.I obtained significantly lower R2 performances, their PMCC2 performance values were impacted less significantly. This indicates that the poor performance of BOTL and BOTL-C.I on these data streams can be attributed to the transfer of high volumes of models, causing the OLS meta learner to overfit. To overcome this, the performance culling threshold, λcperf, could be increased for BOTL-C.I, further restricting the number of models used as input to the meta learner. The more aggressive nature of the culling technique used by BOTL-C.II meant it outperformed the concept drift detection strategies with no knowledge transfer and GOTL.

Overall, RePro was able to achieve a better performance across all drifting hyperplane data streams compared with ADWIN and AWPro. This was caused by RePro monitoring the performance of a model to detect concept drifts, whereas ADWIN and AWPro monitor the distribution of predictive error. Detecting drifts in this way was beneficial to RePro, as unstable models were repeatedly learnt and discarded until the sliding window of data contained instances that were representative of the current concept. Although the use of RePro as the underlying drift detector outperformed ADWIN and AWPro, the difference in performance across these drift detection strategies was not statistically significant; therefore, the computational overhead of repeatedly learning unstable models may impact RePro’s applicability as the underlying drift detector for BOTL in real-world environments.

8.2 Heating simulation

Lower performances were observed across the heating simulation datasets due to containing more complex concepts, and additional noise, in comparison with the drifting hyperplane datasets. The addition of knowledge transfer, using GOTL and BOTL, provided an increase in performance in comparison with using the concept drift detection strategies, with no knowledge transfer, as shown in Table 6. GOTL, BOTL and BOTL-C variants, using each drift detector, performed better than the drift detection strategy alone, with GOTL achieving statistical t test p values of p < 0.01 over RePro with no knowledge transfer.

Table 6 Heating simulations: average performance (R2, PMCC2, RMSE) and number of models used by the meta learner (|M|) to predict desired heating temperatures across five domains using no knowledge transfer, GOTL, BOTL and BOTL-C variants, where * indicates p < 0.01 in comparison with the concept drift detection algorithm and GOTL, and italicised values indicate the highest R2 performance

The use of GOTL in this setting highlighted the advantage of knowledge transfer when concepts were more complex, preventing the underlying concept drift detectors from building effective models on the window of available data. This meant the knowledge transferred helped enhance the performance of the target predictor, even when only a single model was transferred using GOTL. Whereas using GOTL in environments that have simple concepts to be learnt, such as those present in the hyperplane datasets, provided little to no benefit. Transferring multiple models provided a significant benefit as all BOTL variants performed better than GOTL with a t test p value < 0.01, for all concept drift detection strategies.

8.3 Following distance

Finally, we evaluated BOTL on real-world data using the following distance dataset, where the task was to predict TTC. Due to the real-world nature of this data, concept drifts occurred frequently and data streams were noisy.

Table 7 shows the performance of drift detectors, GOTL and BOTL variants across seven data streams. These results highlight GOTL was less suitable when the relationship between source and target concepts were unknown. Variants of BOTL and BOTL-C that used RePro or AWPro as drift detectors performed better than their respective baseline drift detector and GOTL, achieving statistical t test p values of p < 0.01. However, the BOTL and BOTL-C.I implementations that used ADWIN as the underlying drift detector were not able to outperform the drift detector with no knowledge transfer, or GOTL. Although these variants of BOTL performed poorly, BOTL-C.II using ADWIN as the underlying drift detector outperformed ADWIN and GOTL, achieving statistical t test p values of p < 0.01. This highlights the importance of preventing the transfer of redundant models when using BOTL in real-world environments, as the large numbers of transferred models likely caused the OLS meta learner to overfit the local window of data. This observation was also supported by BOTL-C.II achieving the best performance in comparison with other BOTL variants.

Table 7 Following distances: average performance (R2, PMCC2, RMSE) and number of models used by the meta learner (|M|) to predict TTC across seven domains using no knowledge transfer, GOTL, BOTL and BOTL-C variants, where * indicates p < 0.01 in comparison to the concept drift detection algorithm and GOTL, and italicised values indicate the highest R2 performance

To investigate scalability, Figs. 4 and 5 display the average PMCC2 and R2 performance respectively per domain, and the number of models used by the OLS meta learner to make predictions as the number of domains in the framework increased. For settings with a small number of domains, BOTL and BOTL-C variants achieved similar PMCC2 and R2 performances, outperforming their respective baseline concept drift detection algorithms. However, as the number of domains expanded, and the number of models transferred increased, the PMCC2 performance of BOTL dropped below the performance of the concept drift detection algorithm with no knowledge transfer. Although PMCC2 gradually decreased as the number of sources increased, by considering the R2 performance in Fig. 5, we observe that the average R2 performance decreased drastically. This occurred due to the nature of these performance metrics. As PMCC2 ranges between [0,1], when one domain performed poorly, it did not greatly impact the average PMCC2 across all domains, whereas R2 ranges between \((-\infty ,1]\); therefore, when one domain performed poorly, the average R2 performance was greatly impacted. The difference between performance metrics, shown in Figs. 4 and 5, indicates that BOTL and BOTL-C.I suffered from the OLS meta learner overfitting the small window of local data when the number of models transferred was large. Culling using the performance of transferred models alone (BOTL-C.I) enabled a larger number of domains to be used in the framework, however cannot be considered scalable as the performance of BOTL-C.I decreased below that of the drift detector when more domains were added. BOTL-C.II culled more aggressively, using diversity alongside performance, ensuring enough beneficial knowledge was retained to enhance the target learners’ performance, while minimising negative transfer and preventing the OLS meta learner overfitting the small window of locally available data.

Fig. 4
figure 4

Following distances: PMCC2 performance (with standard error) and number of models used by concept drift detection strategies with no knowledge transfer, BOTL and BOTL-C variants as the number of domains increase

Fig. 5
figure 5

Following distances: R2 performance (with standard error) and number of models used by the concept drift detection strategies with no knowledge transfer, BOTL and BOTL-C variants as the number of domains increase

Additionally, Figs. 4b and 5b show the PMCC2 and R2 performance when ADWIN was used as the underlying concept drift detector. Compared with Figs. 4a and 5a, and 4c and 5c, which used RePro and AWPro respectively, the PMCC2 and R2 performances obtained by ADWIN reduced quickly, even when a small number of domains were included. This again highlights the importance of selecting a concept drift detection strategy that reuses existing models in the presence of recurring concepts, instead of relearning and transferring duplicate models.

Overall, these results indicate that the ability to consider both source and target domains to be online is beneficial. In doing so, the number of transferred models greatly increases, requiring culling mechanisms, particularly when used in noisy real-world data streams, to retain the benefit of transferring knowledge between domains.

9 Conclusion

Online domains that must learn complex models often have limited data availability, and are hindered by the presence of concept drift. We have presented the BOTL framework, and two BOTL-C variants, that enable knowledge to be transferred across online domains. We enhanced predictive performance by combining knowledge transferred from other online domains using an OLS meta learner, enabling additional knowledge to be used to minimise the error of the overarching prediction.

Using RePro as the underlying concept drift detection strategy ensured effective models were learnt from the available data; however, RePro may not be appropriate for use in applications that have computational limitations due to frequently creating unstable models during, or immediately after, periods of drift. Applications that have computational limitations may need to trade-off performance with computation and use a drift detection strategy such as ADWIN or AWPro. ADWIN does not reuse previously learnt models when drifts are detected; therefore, a number of models transferred between domains increases when recurring concepts are encountered. This can degrade the performance of BOTL when the number of domains in the framework, or volume of models transferred, is large.

Instead, AWPro can be used in settings where recurring concepts are likely to be encountered, or computational and communicational resources are limited.

In this paper, we chose RePro, ADWIN and AWPro as the underlying concept drift detection algorithms. Although each of these requires some domain expertise to identify appropriate parameter values, their ability to retain a history of models to prevent relearning recurring concepts is a more influential factor to consider when selecting an underlying concept drift detector for BOTL. RePro and AWPro helped to reduce the number of models transferred between domains and therefore allowed more domains to be included in the framework before the OLS meta learner suffered from overfitting.

However, in real-world environments with many domains, the number of models transferred may need to be reduced further. BOTL-C variants achieved this using common ensemble pruning strategies. These pruning strategies also required culling parameter values to be specified. To overcome the need to specify these additional parameters, we will investigate the use of task relatedness to identify similar concepts across domains without requiring parameterised thresholds in future work. This will reduce the dependency on domain expertise and will allow BOTL to be used for applications that require scalability to larger numbers of domains.