Abstract
The era of big data provides a platform for high-precision RUL prediction, but the existing RUL prediction methods, which effectively extract key degradation information, remain a challenge. Existing methods ignore the influence of sensor and degradation moment variability, and instead assign weights to them equally, which affects the final prediction accuracy. In addition, convolutional networks lose key information due to downsampling operations and also suffer from the drawback of insufficient feature extraction capability. To address these issues, the two-layer attention mechanism and the Inception module are embedded in the capsule structure (mai-capsule model) for lifetime prediction. The first layer of the channel attention mechanism (CAM) evaluates the influence of various sensor information on the forecast; the second layer adds a time-step attention (TSAM) mechanism to the LSTM network to weigh the contribution of different moments of the engine's whole life cycle to the prediction, while weakening the influence of environmental noise on the prediction. The Inception module is introduced to perform multi-scale feature extraction on the weighted data to capture the degradation information to the maximum extent. Lastly, we are inspired to employ the capsule network to capture important position information of high and low-dimensional features, given its capacity to facilitate a more effective rendition of the overall features of the time-series data. The efficacy of the suggested model is assessed against other approaches and verified using the publicly accessible C-MPASS dataset. The end findings demonstrate the excellent prediction precision of the suggested approach.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The forecast of the remaining service life is one of the key contributing factors of PHM technology. The proposed PHM technology has changed the contemporary maintenance concept and realized the transformation from planned maintenance to situational maintenance, which can comprehensively measure maintenance costs, resource losses, and production benefits, and situational maintenance can also address current safety hazards and has been widely applied in multi-domain such as electronics, aviation industry, and military applications [1, 2]. Health prediction is one of the core missions of PHM, and its major purpose is to calculate the remaining service life (RUL) of equipment by revealing its performance degradation pattern with information on equipment operation status. Accurate RUL predictions allow for a sound maintenance plan and timely replacement of safety-hazardous components, thus allowing for the avoidance of sudden system failures [3].
To solve the above problems, many advanced methods for remaining service life prediction have emerged, which mainly fall into two directions, i.e., mechanistic model-based and data-driven. For a model-based forecast project, an accurate construction of the dynamics of the mechanical system (or component) is required [4, 5]. Common model-based research methods include particle filters [6], Weibull distribution [7], etc. However, with the rapid development of industry, even for industry specialists, it is impossible to construct a comprehensive and ideal mechanism model because there are an increasing number of highly integrated mechanical structures with increasingly complex functioning mechanisms.
Moreover, complex mechanical system fault types are also varied, and a single machine model is not well adapted to complex and variable faults; therefore, the solution based on the machine model is less flexible [8] and has certain limitations.
In recent years, data-driven-based schemes have been widely used for lifetime prediction [9]. Data-driven prediction methods are based on a lower threshold of expertise and no longer require knowledge of the detailed working mechanisms of mechanical systems. Instead, we only need to gather correlative data from the system through sensors, based on a data-driven algorithm to capture the degradation trend in the degradation data for accurate prediction of RUL [10]. Compared to the mechanistic model approach, the machine learning-based approach abandons the specific stochastic process and it can set up a plain relation between degraded data and RUL through algorithms. Commonly used lifespan prediction methods based on shallow machine learning include artificial neural networks (ANNs) [11,12], support vector machines (SVMs) [13], random forest algorithms (RFs) [14], and hidden Markov models [15].
However, industrial systems are becoming increasingly complex, and shallow machine learning methods are not well suited to handle massive amounts of degraded data. With the improvement of computer technology, which provides a platform for solving big data problems, deep learning has gained unprecedented development and is used in various fields [16,17,18]. Convolutional neural networks (CNN) focus on describing spatial features of sequences with the superiorities of part cognition and parameter sharing [19] and are used for RUL prediction. However, ordinary convolutional networks are prone to lose key information due to pooling and downsampling operations in computation, and the proposed capsule network solves this problem well. Capsule networks [20, 21] have also been used for lifetime prediction in the last two years, achieving good prediction accuracy. Considering the time dependence of the vibration signals, Malhi et al. [22] put forward a competitive learning-based approach that employs a recurrent neural network (RNN) to capture the long-term degradation information of the machine operating state. But recurrent neural network (RNN) long-term model training can produce the problem of gradient vanishing or gradient blast, affect the online solution capability [23]. As a renewal to the traditional RNN, Yuan et al. [24] drew a neural network based on Long Short-Term Memory (LSTM) to process temporal data, which was used for RUL of an aero motor in complex operating environments with high noise and multiple faults. However, there are still some limitations of LSTM in RUL prediction, such as The traditional LSTM focuses only on the relevant features learned at the last time step for the prediction of the remaining lifetime [8,25], However, relevant features learned at other time steps of the engine's full life cycle may also contribute to the final life prediction, so it is necessary to distribute appropriate weights to the further significant sensors and time steps to ensure better grasp of key information [26]. The attention mechanism has generated widespread interest in the domain of RUL prediction [27] in recent years due to its ability to assign weights to various features based on their influence on the mechanical degradation trend. Ren L et al. [28] put forward an attention-based deep learning system to simulate human attention mechanisms by building an attention network that can assign corresponding weights to different features and improve certain accuracy. In recent years, migration learning has been extensively applied for lifetime prediction due to its capacity to better handle prediction tasks in the field of variable operating conditions. Anxi Zhang et al. [29] proposed a migration learning algorithm based on a Bidirectional Long Short-Term Memory neural network for lifespan prediction, which solved the cross-domain prediction task. Yangpan Tang [30] et al. applied a meta-migration learning strategy to lifetime prediction and achieved high prediction accuracy while improving the learning performance of adaptive hyperparameters in a small sample environment. Huang, Bi-Qing, et al. [31] proposed an agnostic meta-learning algorithm to solve the difficult problem of difficult access to full-lifecycle data in practical forecasting tasks, and the model constructs a pseudo-meta-task set by discriminating the similarity between cross-domain time series, which further improves the generalization capability of the method under the condition of small sample.
Combining the advantages and shortcomings of the above methods, this paper offers a deep learning approach for remaining service life prediction using capsule systems, founded on two-layer focused attention and multi-measure collection of features, with the following main contributions:
-
1.
A channel attention mechanism is introduced, and the raw data are assigned corresponding weights for different sensors by the CAM. TSAM is embedded in the LSTM network to weigh the importance of various moments in the whole life cycle of the engine while weakening the impact of noise in the raw data.
-
2.
Replacing the convolutional layer with the inceptionV1 module to extract multiscale features as input to the capsule network, which promotes the feature extraction capacity of the model.
-
3.
The application of the capsule layer enables a more effective portrayal of the overall features of the temporal data while preventing the disappearance of degradation information due to downsampling and pooling operations in traditional convolutional networks (CNNs), ensuring the integrity of the information and improving the prediction accuracy.
-
4.
Multiple multi-type experimental validations on publicly available datasets prove the feasibility of the proposed method.
2 Methodology
2.1 Long Short-Term Memory
Life prediction requires multiple sensors to collect degradation data with temporal correlation in multiple dimensions of the predicted object, commonly such as vibration and acoustic information. Recursive neural (RNN) is extensively utilized for the prediction of remaining service life because it is better at handling time-series data, taking into account the correlation between the moments before and after the time-series data. However, the issue of "long-term dependency" arises after the nodes of RNN are computed in multiple stages, which leads to gradient disappearance or gradient explosion. To overcome this difficulty, Jiawei Liu et al [32]. then used a combination of Long Short-Term Memory (LSTM) and RNN networks to forecast the remaining effective life of the fuel cell. Due to its unique advantages, LSTM networks have also acquired giant success in other fields of temporal data processing, such as video analysis [33] and face recognition [34].
As shown in Fig. 1, it is the Long Short-Term Memory network. It mainly consists of forgetting gate, input gate, and output gate. The forgetting gate \(f_t\) discards redundant information from previous time steps, the input gate \(i_t\) is responsible for filtering information, and the export gate \(O_t\) dominates the export of the network. The figure \(C_{t - 1}\) is used to save the unit state of the previous moment, \(f_t\) determine how much of the united state of the previous time is reserved to the present time \(C_t\), the input gate \(i_t\) decides how much of the input of the present moment \(x_t\) is preserved to the united state \(C_t\), \(O_t\) control how much of the united state \(C_t\) is export to the current output value \(h_t\) of the LSTM. \(\sigma\) and \(\tan h\) are activation functions respectively, the LSTM network is calculated as follows:
Among them, \(w_f\), \(w_i\), \(w_C\), \(w_O\) are the calculated weights of the forgetting gate, the input gate, the state and the output gate, respectively.\(b_f\),\(b_i\),\(b_C\),\(b_O\) are their corresponding offsets, respectively.
Given the excellent temporal modeling capability of LSTM networks, this paper uses the LSTM network to learn the temporal characteristics of data. To address the limitation that the LSTM network only focuses on the features learned in the last time step for ultimate prediction. In this paper, a temporal attention mechanism is embedded in the LSTM network to assign weights to various time steps to improve the learning capacity of the model.
2.2 Attention Mechanism
Attention mechanism [35] is a data-handling technique in deep learning and is extensively used in different sorts of deep learning tasks such as natural language handling, face recognition, and speech recognition. The nature of the attentional mechanism is similar to that of human vision. When people observe external things, they first tend to look for the most representative features, and then weaken the secondary features, to form a macro overall impression of the target thing. The self-attention mechanism used in this paper assigns computational resources to more significant tasks, by which finite attention resources can be employed to rapid screening out key information from massive information. It has been successfully applied in [28,36,37] and many other fields. Its calculation formula is as follows:
1.Data Sample \(H = \left\{ {h_1 ,h_2 ,h_{\text{i}} ,...,h_m } \right\}\),\(h_i \in R^m\), m is the number of sequences of the current feature. The importance of any one feature is scored according to the following activation function:
\(\phi \left( \cdot \right)\) is the activation function, often referred to as the sub-function in the attention mechanism to judge the importance of features:
2.The obtained scores are numerically varied by the softmax function. On the one hand, it can be normalized to get a new probability distribution with a weight of 1, and on the other hand, it can highlight key features.
3.The final output feature data after weighting is:
\(O = H \otimes A = \left\{ {\alpha_1 h_1 ,\alpha_2 h_2 ,...,\alpha_m h_m } \right\}\),\(\otimes\) is the multiplication of the corresponding elements.
2.3 Inception V1
The Inception network architecture was proposed by Google in 2014. Compared with the traditional convolutional layer, the network depth and width of this network structure are further enhanced. Inception V1 employs convolutional kernels of different sizes (different sizes of convolutional kernels with different perceptual fields) in the same convolutional layer for multi-scale feature extraction and proposed parallel merging of convolutional kernels (also known as the Bottleneck layer).To avoid overfitting due to increasing the width and depth of the network, sparse connections are used instead of full connections. Thinking from another perspective, this design is consistent with intuition, similar to how humans perceive a thing, i.e., they process it from different perspectives and then aggregate the features learned from different perspectives.
In this paper, we use the Inception module instead of the traditional convolutional layer to abstract degraded features of sensor data at multiple scales as the input to the capsule network. The construction of the Inception module is indicated in Fig. 2 [38].
2.4 Capsule
Capsule networks were first proposed by Geoffrey Hinton [39] in 2011, and in late 2017, Geoffrey Hinton et al. [40] proposed the capsule architecture, a new deep neural network model. Such models are currently used in several fields, such as image recognition, lifetime prediction [20, 21], and fault diagnosis. Unlike traditional network neurons, Capsule takes the form of vectors as input and output, and benefits from a dynamic routing algorithm that discards the pooling operation inside traditional convolutional networks to maximize the integrity of the prediction information. As in Eq. (9), traditional convolutional networks deal with high-level and low-level features as a simple weighted sum, while capsule networks are good at capturing the positional relationships between high-level and low-level features.
where * denotes the convolution operation,\(x_l^{\left( m \right)}\) is the mth feature map output by the Lth layer of the convolutional layer,\(\alpha\) is a nonlinear activation function,\(w_l^{\left( {d,m} \right)}\) is the weight matrix of \(b_{ij} \leftarrow b_{ij} + u_{j|i} \cdot v_j\),\(b_l^m\) is bias. As shown in Fig. 3, a schematic diagram of a basic capsule network, consisting of a convolutional layer, a primary capsule layer, and a digital capsule layer.
To understand the capsule network in more depth, it is essential to understand the learning strategy of the capsule network, i.e., the dynamic routing between the capsules in Fig. 4 [39]. Capsules can be regarded as neuron vectors in groups, vector dimensions are used to represent the spatial location information of features, and the length of the vectors can indicate the odds of feature existence. These vectors are directional, so the capsule can guarantee the translational isotropy of the features during feature extraction, while traditional convolutional networks have translational invariance. Therefore, CapsNet is used as the final high-dimensional feature extraction module in this paper. The iterative computation of dynamic routing can be expressed as:
Equation (10)\(w_{ij}\) is the prediction matrix, which is multiplied by the input vector \(u_i\) to obtain the high-level feature prediction vector \(u_{j|i}\). Equation (11) is the weighted sum of all divination vectors to obtain the vector \(s_j\). \(c_{ij}\) is the coupling coefficient, which determines the information transfer between the low level capsule and the high level capsule, updated by Eq. (13). The initial value of \(b_{ij}\) is 0, which to some extent reflects the similarity of the output vector and the input vector. The vector \(v_{\text{j}}\) is obtained by squeezing the function Squash from Eq. (12) and the length is compressed to (0,1).
3 The Proposed MAI-Capsule Model
As indicated in Fig. 5. [43], the double-layer attention structure module, the Inception multiple-scale feature extraction module, and the capsule network comprise the three distinct modules of the proposed MAI-Capsule model framework. The normalized sensor data are subjected to a CAM in order to assess the impact of various sensor data on the prediction. Then, the TSAM is embedded after the LSTM network to learn the temporal characteristics of the data, calculating the influence on the prediction of various engine life cycle moments. The Inception module extracts multiple-scale characteristics from the weighted data output of the two-layer attention structure, which is then fed into the capsule network for regression prediction.
3.1 Attention Mechanism Model
In reality, a number of variables could tamper with the sensor's data collection. For instance, information gathered at several sites and different kinds of sensors have various effects on the prediction, and even the data gathered by the same sensor at different moments of the mechanical cycle are different. In traditional neural networks, the weights of data collected from different sensors and different moments of the same sensor are equally distributed in prediction, which leads to important degradation information being ignored and non-essential information being amplified, affecting the precision of the final forecast.
First is the channel attention mechanism. The data sample can be signified as \(x = \left\{ {x_1 ,...,x_i ,...,x_m } \right\}\), the data from different sensors at moment t can be signified as \(x_t = \left\{ {x_{1,t} ,...,x_{i,t} ,...,x_{m,t} } \right\}\), m is the number of sensors.
1. The sample data of the sensors at moment t is passed through Eq. (15) to obtain the scores of different sensors:
The score of all sensors at moment t is obtained as \(S_t = \left\{ {S_{1,t} ,S_{2,t} ,...,S_{i,t} ,...,S_{m,t} } \right\}\).
2. After obtaining the score of the sensor at moment t, it is transformed into the weight value of the ith sensor at moment t by Eq. (16):
That is, the total sensor weights at time t are denoted as \(\alpha_t = \left( {\alpha_{1,t} ,...,\alpha_{i,t} ,...,\alpha_{m,t} } \right)\).
3. Calculate the average weight of the ith sensor:
The weight of every sensor on average is denoted as \(\alpha = \left\{ {\overline{\alpha_1 },\overline{\alpha_2 },...,\overline{\alpha_m }} \right\}\), T is the total cycle time.
4. The first-level attention mechanism output can be expressed as:
Following CAM processing, unnecessary information is decreased and degraded information from vital channels is given more weight.
Next the TSAM further weights the export of the LSTM network in the length of time dimension. The data collected by the same sensor at different moments of engine operation contribute differently to the final prediction, so the TSAM is used to capture more critical time points and improve the prediction accuracy. Sample data \(x^{\prime} = \left\{ {x_1^{\prime} ,...,x_i^{\prime} ,...,x_m^{\prime} } \right\}^{\text{T}} = \left\{ {x_1^{\prime} ,...,x_t^{\prime} ,...,x_T^{\prime} } \right\}\), T is a transpose operation and T is a period of time cycle. The data of the ith sensor at different moments is \(x_i^{\prime} = \left\{ {x_{i,1}^{\prime} ,...,x_{i,t}^{\prime} ,...,x_{i,T}^{\prime} } \right\}\).
1. As with the CAM, the fraction of different time steps of the sensor is first calculated according to Eq. (19):
Here \(S_i = \left\{ {S_{i,1} ,S_{i,2} ,...,S_{i,t} ,...,S_{i,T} } \right\}\), the score of the i-th sensor at each moment.
2. The weight of the i-th sensor at moment t can be calculated according to Eq. (20):
That is, the weight value for all moments of the ith sensor is \(\eta_i = \left\{ {\eta_{i,1} ,\eta_{i,2} ,...,\eta_{i,t} ,...,\eta_{i,T} } \right\}\).
3. Eq. (21) yields the average value for each sensor weights at any given instant t:
The weight mean over entire steps is \(\eta = \left\{ {\overline{\eta }_1 ,\overline{\eta }_2 ,...,\overline{\eta }_t ,...,\overline{\eta }_T } \right\}\).
4. The output of the second layer of attention mechanism can be expressed as:
The weighted data obtained after the above two layers of attention mechanism processing highlight more meaningful exacerbation information concealed in the data and lays the foundation for subsequent feature extraction.
3.2 The Inception Module
The sensor-collected data is quite sophisticated and includes information on deep deterioration. To fully extract the hidden regression information in the weighted data, it is necessary to promote the function of the network by increasing the depth and width of the network, i.e., the number of layers and neurons of the network. However, this operation will increase the computing burden and produce overfitting. The Inception module solves this problem well with a sparse network structure but can generate dense data, improving the effectiveness of the neural network while guaranteeing the effective utilization of computational resources.
In this paper, we use Inception modules instead of traditional deep convolutional layers, by using convolutional kernels of different sizes, while using parallel connections. As shown in Fig. 2., inceptionV1, the 1*1 convolution is used to reduced data dimensions, the borders of the feature are covered in a "same-" strategy, and ultimate, a Concatenate layer is used to aggregate the learned multi-scale features.
3.3 Capsule Network
The main structure of the capsule network consists of a primary capsule layer and a digital capsule layer. The primary capsule layer is made up of convolutional and reshaping layers. The convolution layer uses 32 convolution kernels of size 10 with a convolution step set to 1 to create a feature map of 32 continuous sensors to mine low-dimensional information from the export of the inception module. The mined low-dimensional features are reshaped into 8-dimensional primary capsules by the reshaping layer and are conducted as input for the second level digital capsule layer. Lastly, in order to extract high-dimensional features and maintain the temporal data's overall positional hierarchy, the digital capsule layer employs ten 8-dimensional capsules. In particular, the number of capsules affects the accuracy of prediction to a certain extent, and subsequent comparison experiments will be conducted with the number of capsules network as the variable.
The use of capsule networks makes it possible to describe time-series data's general properties more effectively. DRA is responsible for updating the two capsule segments' coupling coefficients [39], and the paper sets up 3 rounds of calculation. The correlation between related capsules is raised while the correlation between unrelated capsules is decreased by ongoing iterative revisions [21], and the value in Eq. (9) characterizes this property.
Since lifetime prediction is a typical regression problem, the loss function is selected as the mean square error (MSE) [34]. The optimizer uses the Adam algorithm. To promote the prediction of the model, a learning rate attenuation tactics is added, with the initial value set to 0.001 and decreasing to 0.0001 and then stopping.
4 Experimental study and discussion
4.1 Dataset
At this part, the suggested model will be empirically validated using the NASA (C-MAPSS) public dataset [41], description of the assessment metrics, experimental results, and analytical discussion used in this thesis. Keras was used to construct the network's structure.
The commercial aero-engine dataset provided by NASA [41] consists of 21 sensors located at different locations to record the degradation progression of aircraft engines. The dataset models the engine's degradation performance under various working situations and flight modes, each with a unique collection of failures. Figure 6 displays a streamlined diagram of the aero-engine in line with the data set, whose principal elements are fans, low-pressure compressors, nozzles, high-pressure rotors, etc. An engine that is unknown at first is considered healthy, and when a failure occurs it affects engine performance and The sensors keep an eye out for unusual data. From healthy condition to stopping is the life cycle of the engine, and the test data contains the data of the whole life cycle of the engine. There are four sub-datasets in the C-MAPSS dataset, and the details are displayed in Table 1.
Each subset of the C-MAPSS dataset contains 26 columns for both training and test data, the pertinent motor facets are shown in the first five columns, and the deteriorated data gathered by the 21 sensors spread across various locations is shown in the next 21 columns. There is no impact on the ultimate prediction because certain sensors (like those in dataset FD001) continuously gather settled data from the beginning of engine running until the conclusion of its life. The 14 valuable sensors selected in the experimental part of this paper are 2,3,4,7,8,9,11,12,13,14,15,17,20,21.
In practical applications, when a failure occurs, the engine's operating performance decreases significantly until the end of its life, when it is totally ruined. Heimes feels that an engine's RUL in a good state should be between 120 and 130, based on research [42]. The highest RUL in this article is 125.
4.2 Data Preprocessing
In this paper, we apply the sliding time window approach in the literature [36] to produce a sample of data, a sample [43] instance is shown in Fig. 7. Given a complete life cycle of T, a time step of S, and a window length of p, the sample size would be p × m. Each sample's RUL is T—S—P, and there are m sensors in total.
Normalization is required before the production of data samples. In this paper, we utilise the normalizing technique of maximum-minimum. The following formula (23), for each sensor, normalizes the raw data to [0,1].
where \(\max \left( {x_m } \right)\),\(\min \left( {x_m } \right)\), is the maximum and minimum value of m sensor. \(x_{m,t}\) is the value of m-sensor's instant t.
4.3 Evaluation Metrics
In this thesis, the root mean square error (RMSE) values [35] and the scoring function [41] were selected to assess the predictive performance of the model.
N is the total number of test samples, and \(d_i = y_{pred}^i - y_{true}^i\) is the measure of the disparity between the actual and predicted values.
The formula (24) used to compute RMSE is:
Score is expressed as:
The scoring system with varying degrees of severity, score penalizes underprediction (di < 0) and overprediction (di > 0). The punishment for over-prediction is higher than that for under-prediction since the real-world repercussions are more dire. The less the RMSE and Score, which collectively assess the predictive accuracy of the model, the better.
4.4 Experimental Implementation and Results
To testify the superiority of the model, five experiments are established. The first experiment investigates the effect of producing data samples with various time steps on the final prediction performance. The second research is the loss differences based on different activation functions. The third research content is the influence of the number of digital capsules on the prediction. The suggested model's ablation experiment, which tries to prove the reliability of the model's constituent parts, is the fourth experiment. The last experiment is to contrast the effect of this model with other forecasting models.
4.4.1 Experiments with different time windows
Using different sizes of time windows to produce data samples will result in different amounts of information comprised in apiece sample. Too little will fail to capture the key information, and too large, although containing a large amount of information, will increase the calculation load of the model, which in turn affects the final prediction performance. That's why it's crucial to choose a suitable time window. This experiment uses different time windows to collect data samples on data sets FD001 and FD002, and the final forecast effects of the experiment are shown in Fig. 8. The left Y-axis of the graph indicates the value of RMSE, and the right Y-axis is Score. From the FD001 experimental results, it can be obtained that while the time window size is set to 30, both the RMSE and Score metrics are minimum. From FD002 on the right, we can also get the minimum Score value at a time window of 30, and the RMSE is the smallest at a time window of 20, but the difference is not large compared to 30. Combining the results of the two experiments, the final time window size chosen for this study was 30.
4.4.2 Loss differences and prediction accuracy based on different activation functions.
In the experiments, it was found that the choice of different activation functions affects the final prediction accuracy as well as the convergence speed of the model. This experiment, by comparing the model convergence speed and prediction accuracy under four activation functions, PReLU, ELU, ReLU, and SeLU. It was finally determined that ReLU was selected as the activation function in this study, and the final effects are seen in Table 2 and Fig. 9. To minimize the effect of random errors, all results are the average of 10 experiments. According to the test results in Table 2, it can be gained that the four activation functions have little effect on the final prediction results of the experiment, but it can be seen that using ReLU works best, with a 0.5% reduction in RMSE value and a 4.02% reduction in Score compared to SeLU. Also, as shown in Fig. 10, the model converges fastest using ReLU as the activation function. In summary, this study determines that ReLU is chosen as the activation function.
4.4.3 The influence of the number of capsules on the prediction effects as well as the training time
During the experiments, it was found that the number of secondary capsules had a considerable impact on the final prediction performance of the model as well as the training time. A larger number of capsules means more parameters to train, and increasing the complexity of the network takes more time to train. Therefore it is necessary to choose a suitable number of capsules. In this paper, four capsule quantities of 6, 8, 10, and 12 were set up on the FD001 data set and 10 experiments were repeated to take the average value, and the final experimental results are shown in Fig. 10. and 11. Figure 10. indicates the effect of different numbers of capsules on RMSE and training time, from the figure it can be obtained that the RMSE is the smallest and the distribution range is more concentrated when the number of capsules is 10. Although the training time is longer compared to the number of capsules 6 and 8, with the unceasing improvement in computer power, the time effect can be neglected. Figure 11 is consistent with the final effects presented in Fig. 10, where the number of capsules is 10 and the lowest indexes are found. Therefore, summarizing the analysis of the experimental results, it is determined that the number of digital capsules in this paper is set to 10. (Note: The green line in Figs. 10 and 11 is the median line, and the black box is the mean value.)
4.4.4 Ablation experiments of the proposed model
The purpose of this set of experimental sessions is to demonstrate how each suggested model module affects the accuracy of the final prediction. Five groups of experiments were set up in this round, namely, no attention mechanism model (I-capsule), channel-only attention mechanism model (CAI-capsule), time-only attention mechanism model (TAI-capsule), superimposed convolutional layers instead of Inception module (MAN-capsule), and proposed model (MAI-capsule). To enhance the accuracy of the experiments, experimental validation was performed on each of the four data sets, and each model was averaged over 10 experiments. The experimental results are seen in Fig. 12 and Fig. 13, corresponding to the data in Table 3. In the table, Mean is the average value and STD is the standard deviation. STD reflects the stability of the model to a certain extent, the smaller it is, the more stable the model is. According to the experimental effects in the graphs,
it is clear that the model with the attentional mechanism works better than the model without the attentional mechanism. Moreover, the fidelity of the model is higher as the layers of the attention mechanism are increased, so it can be well demonstrated that the forecast performance of the model can be conspicuously improved by assigning weights using the attention mechanism. In addition, the Inception module, because it uses convolutional layers with a parallel mechanism, has fewer parameters and is easier to train than a stacked convolutional layer network with a series mechanism, and has significantly improved prediction results.
4.4.5 Comparison between different forecasting models.
In this round of experiments, five models were selected for experimental comparison to prove the advantage of the proposed prediction model. One is the shallow learning model Random Deep Forest (RF) [14], and four deep learning models: the Multilayer Attention and Temporal Convolutional Network model (MLSA) [35], the Deep Capsule Network model (NDCN) [20], the Gated Capsule Network model (GAM-CapsNet) [21], and the Deep Separable Convolutional Network model (DSCN). To ensure the accuracy of the experiments, experiments were conducted on four data sets, and all were averaged after ten times. Final experimental data are shown in Tables 4 and 5, corresponding to Fig. 14. and 15, respectively. As shown in Table 4 and Fig. 14., there is a significant improvement in RMSE values on FD001, FD002, and FD004 compared to the other models, especially on the FD004 dataset where there is a 6.33% improvement in RMSE compared to the best of the other models. According to Table 5 and Fig. 15., it can be indicated that the Score improvement is significant on FD001, FD003, and FD004 compared to other models, and the Score improvement is 27.05% on FD004 compared to the MLSA model. In addition to this, STD also has a significant improvement compared to other models. In summary, the proposed model achieves the expected results and is more stable.
5 Analysis
The fitted trajectories between the real RUL and the forecast RUL for the four sub-datasets are displayed in Fig. 16. As can be gained from the figure, the forecast effects of the four sub-datasets are good, especially on datasets FD001 and FD003 where the fit is better. Compared with the fitting effect of FD001 and FD003, the poor fitting effect of FD002 and FD004 is owing to the truth that FD001 and FD003 have only one operation mode, while FD002 and FD004 have six operation settings, so the operation conditions of FD002 and FD004 are more sophisticated, which increases the uncertainty of the prediction.
As shown in Fig. 17, the error normal distribution plots of the real RUL and the forecast RUL for the four sub-datasets are included in this paper to show the final prediction more clearly. The results in the figure are obtained by making an arithmetic difference between the forecast value and the real value. From the figure, the prediction errors of the four data sets are uniformly distributed around 0, and the maximum error is also below 60, further proving the superiority of the proposed model. The errors are more concentrated around less than 0, i.e., under-prediction (forecast RUL < real RUL). Over-prediction (forecast RUL > real RUL) is much more harmful than underprediction in real production life, which demonstrates the reliability of the model.
To understand the two-layer attention mechanism in the model more intuitively, a time series sample from FD001 is chosen for a simple visualization of the attention mechanism in this paper. As revealed in Fig. 18. (a) expresses the data after the normalization of a data sample, and (b) expresses the output of the normalized data after the channel attention layer. A comparison of the two plots from a and b clearly shows that the color of the thermogram changes, especially for sensors 4, 8, and 14. This is because the channel attention mechanism assigns corresponding weights to different sensors, resulting in different contributions of the raw data from each sensor to the final prediction. Fig. c shows the export of the data after the time-step attention mechanism layer, and it can be seen that the data weighted by the time-step attention mechanism undergoes a significant difference, focusing more on the moments that contribute more to the final prediction. The two layers of attention mechanisms work together to capture the more important degradation information and greatly promote the forecast accuracy of the model.
6 Conclusion
In this paper, an MAI-Capsule model is proposed to promote the accuracy of remaining service life prediction. The two-layer attention mechanism network evaluates the effects of various sensor data and different moments of the same sensor data on the final prediction separately. The Inception module extracts multi-scale features from the weighted data and finally inputs them into the capsule network to preserve the time-series data's general characteristics more effectively. The feasibility of the model was verified on a publicly available dataset of turbofan engines. The advantage of the proposed model is authenticated by contrasting it with other methods. In addition, experiments are conducted on the number of capsules for the model training time, which is crucial in future real-time prediction.
The main advantages of the proposed model are as follows: first, the channel attention mechanism processes the data and highlights the contribution of important feature moments. Second, adding a temporal attention mechanism to the LSTM effectively reduces the influence of environmental noise while screening out features at important moments. Again, a multi-scale feature extraction module is added to retrieve data about degradation more comprehensively from multiple perspectives. Finally, the capsule network better identifies the spatial location relationship of the overall features. Combining all the test effects, the proposed method in this paper has significantly improved the prediction performance and can be applied in future remaining service life predictions. However, the model still has room for improvement and enhancement as the training time grows under complex working conditions. Moreover, the random initialization of the neural network parameters increases the uncertainty of the final prediction. In future research, we will concentrate on how to further reduce the model complexity, shorten the model training time and optimize the parameter initialization.
Data Availability
The authors certify that the publication contains data that supports the present study's conclusions. Upon reasonable request, the corresponding author can obtain the original data that support the current conclusions.
References
Sheppard JW, Kaufman MA, Wilmer TJ (2009) Standards for prognostics and health management. IEEE Aerosp Electron Syst Mag 24(9):34–41. https://doi.org/10.1109/maes.2009.5282287
Brown ER et al (2007) Prognostics and health management a data-driven approach to supporting the F-35 lightning II. IEEE Aerosp Conf. https://doi.org/10.1109/aero.2007.352833
Benkedjouh T et al (2013) Remaining useful life estimation based on nonlinear feature reduction and support vector regression. Eng Appl Artif Intell 26(7):1751–1760. https://doi.org/10.1016/j.engappai.2013.02.006
Qian Y, Yan R, Gao RX (2017) A multi-time scale approach to remaining useful life prediction in rolling bearing. Mech Syst Signal Process 83:549–567. https://doi.org/10.1016/j.ymssp.2016.06.031
Zhai Q, Ye ZS (2017) Prediction of deteriorating products using an adaptive wiener process model. IEEE Trans Indus Inform. 13(6):2911–2921. https://doi.org/10.1109/tii.2017.2684821
Jouin M et al (2016) Particle filter-based prognostics: Review, discussion and perspectives. Mech Syst Signal Process 72:2–31. https://doi.org/10.1016/j.ymssp.2015.11.008
Ali JB et al (2015) Accurate bearing remaining useful life prediction based on Weibull distribution and artificial neural network. Mech Syst Signal Process 56–57:150–172. https://doi.org/10.1016/j.ymssp.2014.10.014
Chen Z et al (2021) Machine remaining useful life prediction via an attention based deep learning approach. IEEE Trans Indust Electron. 68(3):2521–2531. https://doi.org/10.1109/tie.2020.2972443
Liao H, Zhao W, Guo H (2006) Predicting remaining useful life of an individual unit using proportional hazards model and logistic regression model. In: Reliability and maintainability symposium 2006. RAMS '06. Annual IEEE Computer Society, February 26th, 2006. https://doi.org/10.1109/rams.2006.1677362
Tran H et al (2020) A novel machine-learning based on the global search techniques using vectorized data for damage detection in structures. Int J Eng Sci 157:103376. https://doi.org/10.1016/j.ijengsci.2020.103376
Khatir S, Boutchicha D, Le Thanh C, Tran-Ngoc H, Nguyen TN, Abdel-Wahab M (2020) Improved ANN technique combined with Jaya algorithm for crack identification in plates using XIGA and experimental analysis. Theor Appl Fracture Mech. 107:102554. https://doi.org/10.1016/j.tafmec.2020.102554
Zenzen R et al (2020) A modified transmissibility indicator and Artificial Neural Network for damage identification and quantification in laminated composite structures. Comp Struct. https://doi.org/10.1016/j.compstruct.2020.112497
Tran VT et al (2012) Machine performance degradation assessment and remaining useful life prediction using proportional hazard model and support vector machine. Mech Syst Signal Process 32:320–330. https://doi.org/10.1016/j.ymssp.2012.02.015
Zhang C et al (2017) Multiobjective deep belief networks ensemble for remaining useful life estimation in prognostics. IEEE Trans Neural Netw Learn Syst 28(10):2306–2318. https://doi.org/10.1109/tnnls.2016.2582798
Dong M, He D (2007) A segmental hidden semi-Markov model (HSMM)-based diagnostics and prognostics framework and methodology. Mech Syst Signal Process 21(5):2248–2266. https://doi.org/10.1016/j.ymssp.2006.10.001
Ding H et al (2020) A remaining useful life prediction method for bearing based on deep neural networks. Measurement 172:108878. https://doi.org/10.1016/j.measurement.2020.108878
Wang B et al (2019) Deep separable convolutional network for remaining useful life prediction of machinery. Mech Syst Signal Process 134:106330. https://doi.org/10.1016/j.ymssp.2019.106330
Wang ShengchunWang, HaoZhou YunlaiLiu, JunboDai PengDu, XinyuWahab MA (2021) Automatic laser profile recognition and fast tracking for structured light measurement using deep learning and template matching. Measurement 169:108362. https://doi.org/10.1016/j.measurement.2020.108362
Babu GS, Zhao P, Li XL (2016) Deep convolutional neural network based regression approach for estimation of remaining useful life. Springer, Cham vol. 9642, pp. 214–228, https://doi.org/10.1007/978-3-319-32025-0_14
Palazuelos RT, Droguett EL, Pascual R (2020) A novel deep capsule neural network for remaining useful life estimation. J Risk Reliab 234(1):151–167. https://doi.org/10.1177/1748006x19866546
Zhao C et al (2022) A novel remaining useful life prediction method based on gated attention mechanism capsule neural network. Measurement. 189:110637. https://doi.org/10.1016/j.measurement.2021.110637
Malhi A, Yan R, Gao RX (2011) Prognosis of defect propagation based on recurrent neural networks. IEEE Trans Instrum Meas 60(3):703–711. https://doi.org/10.1109/tim.2010.2078296
Khan AT et al (2021) Enhanced beetle antennae search with zeroing neural network for online solution of constrained optimization. Neurocomputing 447:294–306
Mei Y, Wu Y, Li L (2016) Fault diagnosis and remaining useful life estimation of aero engine using LSTM neural network. In: IEEE international conference on aircraft utility systems, 345 E 47th st, New York, NY 10017 USA, https://doi.org/10.1109/aus.2016.7748035
Cao Y et al (2021) A novel temporal convolutional network with residual self-attention mechanism for remaining useful life prediction of rolling bearings. Reliab Eng Syst Saf 215:107813. https://doi.org/10.1016/j.ress.2021.107813
Yu W et al (2021) Multiscale attentional residual neural network framework for remaining useful life prediction of bearings. Measurement 177:109310. https://doi.org/10.1016/j.measurement.2021.109310
Zhang J et al (2022) Prediction of remaining useful life based on bidirectional gated recurrent unit with temporal self-attention mechanism. Reliab Eng Syst Safety 221:108297. https://doi.org/10.1016/j.ress.2021.108297
Ren L, Liu Y, Huang D et al (2022) A novel multichannel temporal attention-based network for industrial health indicator prediction. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/tnnls.2021.3136768
Zhang A, Wang H, Li S et al (2018) Transfer learning with deep recurrent neural networks for remaining useful life estimation. Appl Sci 8(12):2416. https://doi.org/10.3390/app8122416
Pan T, Chen J, Ye Z et al (2022) A multi-head attention network with adaptive meta-transfer learning for RUL prediction of rocket engines. Reliab Eng Syst Saf 225:108610. https://doi.org/10.1016/j.ress.2022.108610
Mo Y et al (2022) Few-shot RUL estimation based on model-agnostic meta-learning. J Intell Manuf. https://doi.org/10.1007/s10845-022-01929-w
Liu J et al (2019) Remaining useful life prediction of PEMFC based on long short-term memory recurrent neural networks. Int J Hydrogen Energy 44(11):5470–5480. https://doi.org/10.1016/j.ijhydene.2018.10.042
Archana N, Malmurugan N (2020) Multi-edge optimized LSTM RNN for video summarization. J Ambient Intell Humaniz Comput 12(5):5381–5395. https://doi.org/10.1007/s12652-020-02025-8
Kim ST, Kim DH, Yong MR (2016) Facial dynamic modelling using long short-term memory network: Analysis and application to face authentication. In: 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS) IEEE, 22 December, 2016. https://doi.org/10.1109/btas.2016.7791172
Zhiwu S et al (2022) Machine remaining life prediction based on multi-layer self-attention and temporal convolution network. Complex Intell Syst 8(2):1409–1424. https://doi.org/10.1007/s40747-021-00606-4
Du W, Wang Y, Yu Q (2017) Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans Image Process 27(3):1347–1360. https://doi.org/10.1109/tip.2017.2778563
Yuan H et al (2022) Dynamic pyramid attention networks for multi-orientation object detection. Journal of Internet Technology 23(1):79–90
Szegedy C, et al. (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) IEEE, 15 October, 2015. https://doi.org/10.1109/cvpr.2015.7298594
Hinton GE, Krizhevsky A, Wang SD (2011). Transforming Auto-Encoders. In: Artificial Neural Networks and Machine Learning–ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, pp. 44–51, June 14–17, https://doi.org/10.1007/978-3-642-21735-7_6
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. Computer Vision and Pattern Recognition.
Saxena A, Goebel K, Simon D, et al. (2008) Damage propagation modeling for aircraft engine run-to-failure simulation. In: IEEE 2008 international conference on prognostics and health management, pp.1–9. https://doi.org/10.1109/phm.2008.4711414
Heimes FO (2008) Recurrent neural networks for remaining useful life estimation. Prognostics and health management, 2008, in Denver, CO, USA. In: International conference on IEEE, 2008. https://doi.org/10.1109/phm.2008.4711422
Shang Z, Feng Z (2024) Multiscale capsule networks with attention mechanisms based on domain-invariant properties for cross-domain lifetime prediction. Digital Signal Process 146:104368
Acknowledgements
This work was financially supported by The Key Program of Natural Science Foundation of Tianjin (21JCZDJC00770); The National Natural Science Foundation of China and the Civil Aviation Administration of China joint funded projects (U1733108).
Funding
The Key Program of Natural Science Foundation of Tianjin, 21JCZDJC00770, 21JCZDJC00770, 21JCZDJC00770, The National Natural Science Foundation of China and the Civil Aviation Administration of China joint funded Projects, U1733108, U1733108, U1733108
Author information
Authors and Affiliations
Contributions
S. Z.W is responsible for guiding the overall direction, F.Z.H is responsible for writing papers and creating charts, L.W.X is responsible for reviewing papers, and W.Z.H and C.H.C are responsible for other related work.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Human and Animal participants
The article has no research involving Human Participants and/or Animals.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Shang, Z., Feng, Z., Li, W. et al. Capsule Network Based on Double-layer Attention Mechanism and Multi-scale Feature Extraction for Remaining Life Prediction. Neural Process Lett 56, 195 (2024). https://doi.org/10.1007/s11063-024-11651-8
Accepted:
Published:
DOI: https://doi.org/10.1007/s11063-024-11651-8