1. Introduction
Aero-engines are of most importance in ensuring the proper functioning of aircrafts, given their intricate design and the possibility of catastrophic malfunctions. These engines endure extended periods of operation under severe environmental conditions, including elevated temperatures, pressures, velocities, vibrations, and loads [
1,
2]. Therefore, it is crucial to guarantee the dependability of aero-engines in accordance with rigorous safety protocols by continuously monitoring and forecasting the fundamental parameters of the aero-engine [
3,
4,
5].
Exhaust gas temperature (EGT) refers to the temperature of the gas that exits the turbine unit, which is regarded as one of the most crucial structural and operational metrics that demonstrate the performance and efficiency of gas turbine engines [
1,
6]. Elevated EGT can result in significant faults and diminish the engine’s longevity. It is imperative to monitor the EGT during take-off and strive to keep it at a minimum. This is crucial since the EGT reaches its highest point during take-off, and exceeding the usual limits of EGT can lead to engine component failure. From a structural standpoint, accurately predicting the EGT has several benefits, including improved reliability, availability, and engine life extension, as well as reduced operation and maintenance costs [
7]. In aeronautic applications, turbine engines are used to provide the necessary thrust or power throughout different flight phases by either increasing or decreasing the velocity of the air passing through the engines. When dealing with various take-off situations, it is important to find a balance between generating sufficient thrust while keeping the EGT relatively low. Evaluating the EGT level is critical for assessing both the structural and operational aspects, as mentioned before.
For the predicting the EGT of an aero-engine based on real flight data, the current prediction methods are widely classified into two categories: the model-based method and the data-driven method. The model-based approach relies on the precise physical models of the system, which are combined with mathematical and physical models that describe the dynamic performance of the aeroengine. The model-based method is constrained in its applicability due to the need for accurate modeling of the dynamics of mechanical systems or components [
8]. Nevertheless, it is impossible to achieve accurate modeling of intricate systems, even for individuals with specialized knowledge in the domain. Data-driven approaches, as opposed to model-based approaches, offer the benefit of being easier to implement due to their lack of reliance on prior professional competence. Hence, the utilization of data-driven approaches is more common in modern industrial practices [
9].
Based on data-driven approaches, data-driven approaches can be categorized into three types: statistical, machine learning, and deep learning [
10]. The statistical methods commonly used for industrial prediction problems include the autoregressive (AR) model, the autoregressive integrated moving average (ARIMA) model, random forest (RF), and Kalman filters (KF). Given the quickly changing nature of the EGT temporal data, it is evident that traditional statistical methods, which are designed for linear stationary data without differences, are not suitable for accurately predicting EGT [
11]. With the boom in the development of machine learning and deep learning techniques, as well as the progress of sensor technology and real-time databases, the data-driven prediction of engine state parameters has attracted wide attention from academia and industry [
12]. In their study on EGT prediction, Wang et al. [
13] established basic frameworks and employed several common machine learning methods. The analysis included the use of the Generalized Regression Neural Network (GRNN) network [
14], the Radial Basis Function (RBF) network [
15], Support Vector Regression (SVR) [
16], and Random Forest (RF) [
17]. An EGT prediction approach utilizing a long short-term memory (LSTM) network was proposed by Ullah et al. [
18]. The features provided as input were identified as a real-time series. Other efforts are built around the NARX model, which is a kind of recurrent neural network. This model is able to capture the intricate dynamics of complex systems, such as gas turbines, and can be integrated with different types of neural networks. Asgari et al. [
19] have developed NARX models for a robust single-shaft gas turbine. The findings demonstrated the utility of NARX models in predicting the dynamic response of gas turbines. Pham et al. [
20] proposed an enhancement to the hybrid NARX model and autoregressive moving average (ARMA) model for the long-term prediction of machine state using vibration data. Ma et al. [
21] use an FAE-LSTM, a feature attention mechanism-enhanced (FAE) LSTM, to build a NARX model. This model uses EGT-correlated condition features and gas path measurement factors to identify the aircraft engine. The Moving Average (MA) model uses a basic LSTM model to increase the difference between the observed EGT and the NARX model’s anticipated EGT. The endeavour of the aforementioned is to create a real-time dynamic prediction model that can accurately capture the changing reaction of the EGT in different operational states and challenging work settings.
Recently, the self-attention-based Transformer model has been widely used in service industrial areas, including NLP, computer vision, and time series prediction. Self-attention, also known as intra-attention, may be seen as an attentional process [
22]. Natural language processing mostly utilizes the self-attention-based transformer, as proposed by [
23]. The transformer-based model is primarily used for the important task of predicting the remaining useful life (RUL) of aeroengines. Zhang et al. integrated the BiGRU encoder and the Transformer decoder in order to develop a network structure for predicting the RUL of turbofan engines [
10]. In their study, Liu et al. [
24] developed a double attention network that incorporated a multi-head attention module and a 2-D CNN-based channel attention enhancement module. The primary objective of this network was to enhance the accuracy of remaining useful life (RUL) predictions in four different working scenarios.
Given the significant progress made in relevant research, there are still some outstanding obstacles that need more exploration. Unlike the forecast of RUL, which mainly considers the evaluation of the whole lifespan, the prediction of EGT requires a more detailed technique that specifically highlights the time-based features present in actual flight data. Currently, prevalent methods for predicting EGT encompass both physical models and machine learning predictions. Physical models typically involve cylinder combustion models, heat transfer models, and exhaust models. However, the intricate phenomena occurring in the later stages of combustion such as boundary layer effects, uneven fuel distribution, heat conduction, and other factors can pose challenges for traditional physical models in accurately forecasting EGT. Alternatively, by adopting data-driven prediction models and employing machine learning algorithms to simulate combustion, heat transfer, and exhaust cooling processes, the complexity associated with EGT prediction can be mitigated. This approach has the potential to assist or even replace traditional physical prediction models. In order to efficiently track the decline in system performance and accurately capture the important time-related characteristics, we propose the adoption of a transformer-based model. To address these issues, we provide a novel approach Enhanced Scale-Aware efficient Transformer (ESAE-Transformer): a Transformer model with Multi-Head ProbSparse Self-Attention (MHPSA) and a Multi-Scale Feature Aggregation module (MSFAM). This model is a comprehensive framework that consists of an encoder and a decoder. The encoder and decoder are upgraded with MHPSA to effectively capture important operational variations and environmental changes and reduce the compute complexity. Simultaneously, the Multi-Scale Feature Aggregation module (MSFAM) is used to enhance the high-dimensional encoded feature space, hence expanding the range of information that can be captured. This design is expected to greatly improve the precision and effectiveness of EGT forecasts. The main contribution of this work can be summarized as follows:
We recommend the utilization of a specialized model designed for predicting EGT, and this model is built upon the innovative transformer architecture. To our knowledge, this groundbreaking initiative represents the first successful effort to tailor the transformer design specifically for RUL in the context of aero-engines.
The encoder and decoder models leverage an MHPSA mechanism, strategically designed to efficiently decrease temporal complexity and optimize memory utilization. This innovative approach introduces the concept of selective attention, empowering the model to concentrate on the most informative segments. This not only diminishes noise but also prioritizes critical temporal dynamics, enhancing the overall effectiveness of the system.
The implementation of an MSFAM is purposefully crafted to delve into temporal features within a profoundly nonlinear dimensional space. Its primary function is to broaden the receptive field of the prediction model, thereby enhancing the model’s proficiency in effectively processing and amalgamating implicit information across extended sequences or time periods. This strategic design significantly improves the model’s capacity to capture and leverage nuanced temporal dynamics for more robust predictions.
To assess the suggested approach, we conducted evaluations on two fronts. Firstly, we compared the root mean square difference and absolute mean error of predicted results against actual results, varying the dimension of the hidden layer while adjusting the length of the time series input to the model. Optimal performance was observed when the input length was 2 s, and the model dimension was set to 128. Secondly, across different input lengths, we compared our proposed model with contemporary time series prediction models like ANN, LSTM, GRU, and Transformer. The experimental findings revealed that our proposed model outperforms current popular time series prediction models based on the same evaluation criteria.
The remainder of the paper is organised as follows: The core concept of the vanilla Transformer is presented in
Section 2. The proposed method is described in depth in
Section 3. The dataset description and experimental parameters are provided in
Section 4.
Section 5 presents the outcomes of the experiment and provides a thorough analysis. The entire essay is summarized in
Section 6.
2. The Fundamental Principle of the Transformer
The Transformer model, initially introduced in the influential paper titled “Attention Is All You Need”, produced by Vaswani et al. in 2017, has significantly influenced the field of natural language processing (NLP) [
23]. The overall structure of the Transformer is shown in
Figure 1. The model utilizes a methodology known as self-attention or scaled dot-product attention, hence circumventing the inclusion of recurrent layers commonly found in prior sequence-to-sequence models. This particular design decision offers advantages in terms of parallelization and reduced training times.
The Transformer model is characterized by an encoder–decoder structure. The decoder shares important components with the encoder, including Position data embedding, Multi-Head Self-Attention, and Point-Wise Feed-Forward Network. Each sub-layer in the encoder and decoder has a residual connection around it, followed by layer normalization. This design facilitates the flow of gradients during training, making it easier to train deep networks. In both the encoder and decoder, each sub-layer is equipped with a residual connection, which is then followed by layer normalization. This particular architecture enhances the smooth transition of gradients during the training process, hence simplifying the training of deep neural networks.
2.1. Position Data Embedding
Positional encoding is a technique that imparts the model with knowledge regarding the relative or absolute position of the tokens inside the sequence. The positional encodings and embeddings possess equivalent dimensions, denoted as , enabling their summation. This implies that the embedding of each token is modified by the addition of a vector that signifies the token’s positional information inside the sequence. The positional encodings use sine and cosine functions of different frequencies.
For each position
and each dimension
i of the
token embedding, the positional embedding
is defined as (
1):
where
and
denote the positional embedding for a given position and dimension
i in the embedding of the model. The sine function is used for even indices
, while the odd indices
use the cosine function. The
term provides scaling that allows the model to learn to attend by relative positions more easily.
By adding positional encoding to the input embeddings, the Transformer becomes capable of considering the order of the sequence, which is critical for understanding language and other sequence-based data.
2.2. Multi-Head Self-Attention
The incorporation of Multi-Head Self-Attention (MHSA) in Transformer models enables the model to collectively attend to input originating from distinct representation subspaces at varying places. This methodology improves the ability to concentrate on various segments of the input sequence and obtain a more thorough comprehension of the connections within the data.
Operating the attention mechanism in parallel multiple times, each time employing a distinct learned linear projection of the queries, keys, and values, is the underlying concept of multi-head attention. This functionality enables the model to discern various forms of relationships within the data since every “head” can concentrate on distinct characteristics and facets of the input sequence. The MHSA is calculated in the following steps.
Linear Projections. The queries (
Q), keys (
K), and values (
V) are linearly projected multiple times with different, learnable weight matrices, which can be presented by (
2)
where
,
,
are the weight matrices for the
head’s linear transformations of
Q,
K, and
V.
Scale Dot-Product Attention. Each head computes attention on its respective projections, using a scaled dot-product attention mechanism. This involves calculating dot products of the queries with all keys, dividing each by
, and applying a softmax function to obtain weights on the values. The Scale Dot-Product Attention can be expressed as (
3):
Concatenation and Final Linear Projection. The outputs from each head are concatenated and then linearly transformed into the expected dimensions to acquire the final output of the MHSA.
where Concat is the concatenation operation and
denotes the weight matrix for the final linear transformation.
2.3. Point-Wise Feed-Forward Network
In the context of a Transformer, the feed-forward network (FFN) is uniformly and independently applied to each location. This implies that every location in the output of the encoder or decoder, which refers to the representation of each word or token, undergoes the same FFN. However, the FFN functions independently at each position. A typical configuration of an FNN is the inclusion of two linear transformations, separated by a Rectified Linear Unit (ReLU) activation function. The first linear transformation maps the input onto a space with a higher number of dimensions (denoted as
), whereas the subsequent linear transformation maps it back to the original lower-dimensional space of the model (denoted as
).
where MHSA is the output of the previous of MHSA block;
denotes the is the rectified linear activation function; and
are trainable parameters of the FFN.
5. Results and Discussion
In our rigorous evaluation of the proposed method for practical application, we conducted several experiments under diverse conditions. These experiments were specifically designed to analyze the impact of varying prediction lengths and input sequence settings on the method’s performance. The prediction length was fixed at 2 s, while the input sequence length was varied, including settings of 16 s, 8 s, 4 s, and 2 s. This comprehensive approach allowed for a detailed assessment of how different input sequence lengths influence the model’s predictive accuracy and efficiency. Such an analysis is critical in understanding the model’s adaptability and effectiveness across a range of temporal scales, providing valuable insights for its application in real-world scenarios.
5.1. The Impact of the Feature Dimension of the Model
In the aforementioned tables, denoted as
Table 1 and
Table 2, a comprehensive illustration of the encoder, decoder, and (MSFAGM) is presented, highlighting their shared feature dimension, represented as
D. The significant role played by the feature dimension
D in influencing the predictive outcomes is evident from these representations as shown in
Table 1. It modifies the dimension of feature space in the model, which is the model dimension parameter. The schematic of the model dimension is illustrated in the
Figure 7. This crucial aspect prompted an in-depth exploration, where the proposed method was subjected to rigorous training across a spectrum of varying feature dimensions. Such an investigation is pivotal in discerning the optimal feature dimension that maximizes the efficacy of the predictive model, thereby ensuring enhanced performance in practical applications. This study underscores the intricate relationship between the feature dimension and the model’s predictive accuracy, providing valuable insights for further optimizations.
Firstly, we will introduce the experimental process in
Table 4. The output of the model is a fixed time series of 2 s in length, while the input of the model is an indefinite time series, and the length of the input is artificially determined. As shown in
Figure 8, the area marked by two red dashed lines is the result that the model needs to predict, and it displays the predicted value of EGT in 16–18 s. To maximize the performance of the model, we conducted experiments on the model from two aspects. On the one hand, it is to change the input length of the model. We use the time series of the first 2 s, 4 s, 8 s, and 16 s of the predicted sequence as model inputs.
The observations drawn from
Table 4 reveal a nuanced trend in the relationship between the model dimension and its overall prediction accuracy. It is intuitively apparent that as the model dimension increases, there is an initial uptrend in prediction accuracy, which is subsequently followed by a downtrend. Notably, the model achieves its lowest error metrics when the dimension value is set to 128. This phenomenon is consistently observed across various time points.
Additionally, a horizontal comparison of the prediction effects at different prediction times elucidates that the highest overall prediction accuracy is attained when the prediction time is set to 2 s. Based on these empirical findings, the paper strategically adjusts the feature dimension to 128 for subsequent method comparisons. This adjustment is premised on optimizing the model’s performance, ensuring that it is attuned to deliver the highest prediction accuracy under these specific parameters.
Figure 9 presents a random selection of prediction results, randomly chosen to illustrate performance across varying input time lengths and feature dimensions. It is discernible that the input time length of 2 s consistently yields the most accurate predictions. Consequently, for a more granular analysis, we have delineated the 2 s input length prediction results in
Figure 10, categorizing them into distinct periods of rise, decline, and fluctuation. This segmentation facilitates a deeper understanding of the model’s predictive behavior under dynamic conditions, highlighting its performance nuances in response to temporal variations.
5.2. Compared with Other State-of-Art Methods
Our model exhibits a significant advantage in forecasting accuracy when compared to other contemporary models. To provide a comprehensive and rigorous comparison, we evaluated our proposed model against four state-of-the-art EGT prediction models, each representing a unique approach within the domain. These include the Artificial Neural Network (ANN), long short-term memory (LSTM), Gated Recurrent Unit (GRU), and Transformer models. This comparative analysis was conducted under varying dimensions to ensure a thorough assessment of each model’s capabilities. The decision to include these specific models stems from their widespread recognition and established efficacy in EGT forecasting. The ANN serves as a foundational model in neural network research, offering a baseline for comparison. The LSTM and GRU, both of which are variants of recurrent neural networks, are renowned for their ability to capture long-term dependencies in sequential data, a crucial aspect in accurate time series forecasting. The Transformer, known for its self-attention mechanism, represents the cutting edge in handling sequential data and offers a contrast to the recurrent architectures of LSTM and GRU. By comparing our proposed model with these diverse and well-regarded models, we aim to demonstrate its superior forecasting accuracy across various dimensions. This comparison not only underscores the strengths of our model but also contributes to a deeper understanding of its performance in the broader context of EGT prediction methodologies.
Table 5 effectively highlights the superior predictive performance of our proposed methods, as evaluated by two key metrics. This enhancement is particularly striking when contrasted with various baseline models. In such comparisons, our method demonstrates exceptional testing performance, especially in scenarios characterized by shorter prediction times.
Figure 11 presents a selection of predictions with varying input lengths, illustrating a consistent trend where shorter input sequences result in improved performance. Notably, with an input length of 2 s, the prediction’s Mean Absolute Error (MAE) shows a significant improvement of
, and the Root Mean Square Error (RMSE) registers a
enhancement when compared to the standard Transformer model. This superior performance extends beyond the Transformer model, also surpassing the results of ANN, LSTM, and GRU models. This clearly indicates the proficiency of the self-attention-based structure in effectively capturing temporal features.
Practically, a shorter input sequence length implies reduced computational load. For instance, with a 2 s input, the Mean Absolute Error prediction of
aligns suitably with real-world environmental conditions. Additionally,
Figure 12 delineates the rising, fluctuating, and decreasing phases of the EGT prediction. This dichotomy in performance can be attributed to the Transformer’s inherent capability to adeptly capture short-term local semantic interactions, making it an ideal choice for time series prediction tasks.
5.3. Ablation Study of the Proposed Method
To thoroughly evaluate the individual contributions of the various components within our proposed model, we embarked on an ablation study. This study was meticulously designed to dissect the impact of each component on the overall predictive efficacy of the model. To ensure a comprehensive analysis, the ablation study was conducted under a range of input sequence lengths, providing insights into how different components perform under varying temporal scales.
In this study, the Transformer model was adopted as the baseline. This choice is strategic as the Transformer’s architecture, renowned for its self-attention mechanism, offers a robust foundation for comparison. By systematically removing or altering specific components of our proposed model and comparing the resultant performance against the baseline Transformer, we can isolate and understand the contribution of each individual component. The result is shown in
Table 6.
Experiment No. 1: The Baseline Transformer Model. This initial experiment establishes a baseline by employing a standard Transformer model. It serves as a reference point for evaluating the enhancements achieved in subsequent experimental configurations.
Experiment No. 2: The Integration of MSFAM. This trial involves the incorporation of the Multi-Scale Feature Aggregation Module (MSFAM) into the Transformer framework. The primary aim is to investigate how MSFAM’s inclusion affects the model’s predictive capabilities. Notably, this integration leads to improved prediction accuracy, particularly with longer input sequences, when compared to the pure Transformer model.
Experiment No. 3: The implementation of MHPSA.In this configuration, the Multi-Head ProbSparse Attention (MHPSA) mechanism is integrated into both the encoder and decoder components of the model. The objective is to examine the overall impact of MHPSA on the model’s performance. The introduction of MHPSA is observed to enhance prediction accuracy consistently across all input sequence lengths, with a marked improvement in the 16 s input, yielding the best Mean Absolute Error (MAE) of .
Experiment No. 4: MHPSA with the Transformer Encoder. This experiment evaluates the effectiveness of a model configuration that combines a standard Transformer encoder with the MSFAM and an MHPSA decoder. The results indicate that a pure MHPSA decoder is particularly beneficial for longer input sequences.
Experiment No. 5: MSPHA with the Transformer Decoder. This setup is a reversal of the previous experiment, featuring an MHPSA encoder, the MSFAM, and a standard Transformer decoder. The focus is to assess the impact of incorporating MHPSA in the encoder while maintaining the traditional Transformer decoder. The findings suggest that this configuration is advantageous for relatively long input sequences.
Experiment No. 6: The MHPSA Encoder and Decoder with MSFAM. The final experiment combines MHPSA in both the encoder and decoder segments, along with the MSFAM. This setup aims to explore the synergistic effects of these components within a unified model. The results demonstrate exceptional performance in capturing temporal features, particularly achieving the best results with 4 s and 2 s input sequence lengths.
Each experiment in this series incrementally builds upon the previous one, allowing for a detailed analysis of how each modification contributes to the overall performance of the model. This systematic approach enables a nuanced understanding of the strengths and limitations of each component within the model’s architecture, guiding further refinements and optimizations for enhanced predictive accuracy in EGT prediction.
5.4. Discussion
In the context of this study, the ESAE-Transformer model was meticulously developed and evaluated for its efficacy in predicting EGT in commercial aircraft, utilizing data from QAR. The assessment of the model’s performance, quantified through MAE and Root Mean Square Error RMSE, was central to our analysis. Our experimental approach was methodically designed to explore the optimal configuration of the model, with a particular focus on the dimensionality of the feature space and the variation in input sequence lengths, all while maintaining a fixed prediction length of 2 s. When juxtaposed with traditional predictive models such as ANN, GRU, LSTM, and the contemporary Transformer models, ESAE-Transformer demonstrated a markedly superior performance, achieving an MAE of . This outcome not only validates the robustness of our model in the realm of EGT prediction but also underscores the potential of advanced analytical methods in enhancing aeronautical applications. However, it is imperative to acknowledge the limitations encountered in this study. While the model shows promising results, there is a discernible scope for augmenting its prediction accuracy. Moreover, the computational efficiency of the model, particularly in the context of real-time onboard application, requires further optimization. These aspects present avenues for future research, where the focus would be on refining the model to achieve higher accuracy and computational efficiency, thereby making it more viable for real-time deployment in aircraft systems.