CTDNets: A High-Precision Hybrid Deep Learning Model for Modulation Recognition with Early-Stage Layer Fusion

Zhao, Zhiyuan; Qu, Yi; Zhou, Xin; Zhu, Yiyong; Zhang, Li; Lin, Jirui; Jiang, Haohui

doi:10.3390/electronics13234641

Open AccessArticle

CTDNets: A High-Precision Hybrid Deep Learning Model for Modulation Recognition with Early-Stage Layer Fusion

by

Zhiyuan Zhao

^1,*

,

Yi Qu

²

,

Xin Zhou

¹,

Yiyong Zhu

¹,

Li Zhang

¹,

Jirui Lin

¹ and

Haohui Jiang

¹

College of Information and Communication, National University of Defense Technology, Wuhan 430010, China

²

Fundamentals Department, Air Force Engineering University, Xi’an 710051, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(23), 4641; https://doi.org/10.3390/electronics13234641

Submission received: 17 October 2024 / Revised: 19 November 2024 / Accepted: 20 November 2024 / Published: 25 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

To further enhance the recognition accuracy of automatic modulation recognition, improve communication efficiency, strengthen security, and optimize resource management, this paper designs a high-precision hybrid deep learning model featuring early-stage layer fusion. This model combines with Convolutional Neural Networks (CNN), Transformers, and Deep Neural Networks (DNN) to enhance the model’s feature extraction capabilities, thereby improving modulation recognition accuracy. Experiments are performed on RadioML2016.10a and RadioML2018.01a, and the results show that this architecture can effectively combine the advantages of different types of models, making the overall performance more robust and suitable for complex automatic modulation recognition problems.

Keywords:

automatic modulation recognition; deep learning; CNN; transformer; DNN

1. Introduction

With the rapid development of communication technology, the diversification of modulation schemes has made Automatic Modulation Recognition (AMR) a critical component in the field of signal processing [1,2]. AMR is a technique used to automatically detect and identify the modulation type of communication signals. Its significance lies in its ability to autonomously determine modulation schemes by analyzing the time-frequency characteristics of the signal’s I/Q (In-phase and Quadrature) components when encountering unknown signals.

The primary goal of AMR is to achieve accurate recognition of various modulation schemes, which has wide-ranging applications in civilian domains, such as spectrum management and interference identification, as well as in military fields including intelligence gathering, electronic warfare, and radio monitoring [3,4]. As an effective tool for understanding and adapting to complex electromagnetic environments, the development of AMR technology is key to achieving intelligent spectrum management and utilization.

With the continuous evolution of wireless communication technologies, AMR has become a focal point of research in wireless communication. In recent years, as signal complexity and diversity have grown, traditional signal processing methods, such as likelihood-based methods (LB methods) [5] and feature-based methods (FB methods) [6,7], face significant challenges in terms of accuracy and adaptability. These traditional approaches often rely on manually designed feature extraction and model assumptions, making it difficult to fully capture the complex characteristics of signals, and consequently, their performance in practical applications is suboptimal. Therefore, achieving efficient and high-precision automatic modulation recognition has become a critical research topic in the current landscape.

1.1. Related Work

In recent years, the development of deep learning technology has provided new ideas and methods for automatic modulation recognition (AMR) [8,9,10,11]. Through the powerful fitting ability of neural networks, deep learning-based methods (DLB methods) can more efficiently learn the complex mapping relationships between signal features and modulation types, thereby achieving automatic modulation recognition. DLB methods do not require manual design of feature extractors and can automatically learn more robust features from large-scale data. Compared to FB methods, DLB methods show higher recognition accuracy and stronger robustness when dealing with complex signal environments. In addition, DLB methods can gradually improve their ability to distinguish different modulation types during the network training process. When data are sufficient and computational resources are allowed, DLB methods are significantly superior to LB and FB methods. Therefore, researchers have proposed various deep learning-based recognition methods based on datasets such as RadioML2016.10a and RadioML2018.01a. These methods continuously optimize the network structure and improve recognition accuracy by analyzing the impact of network parameters on recognition performance.

Reference [12] utilizes a two-dimensional convolutional neural network as a feature extractor for the raw IQ (in-phase and quadrature) signals, followed by a softmax classifier to categorize different modulation types. In [13], a residual neural network (ResNet)-based automatic modulation recognition method was proposed, which effectively improved the recognition accuracy of 24 modulation signals in complex open environments. This method significantly reduced network parameters, shortened training time, lowered hardware requirements, and demonstrated excellent performance in recognizing high-order modulation signals such as 128APSK, 128QAM, and 256QAM. However, the improvement in classification performance for modulated signals is limited if no modifications are made to the network structure. Reference [14] proposed an improved ResNet that automatically recognizes and classifies radio signals propagating in the air by taking the real and imaginary parts (I/Q) of the received signal as inputs to extract features. Reference [15] introduced a waveform-spectrum multimodal fusion (WSMF) method based on deep residual networks for automatic modulation recognition, which significantly enhanced classification performance by fusing multimodal information from the time and frequency domains of I/Q data. Particularly, the recognition performance for high-order digital modulation signals such as 256QAM and 1024QAM was markedly superior to traditional single-modality convolutional neural network (CNN)-based automatic modulation recognition (AMR) methods. Reference [16] proposed a CNN-LSTM network modulation recognition method that combines the strengths of convolutional neural networks and long short-term memory networks, leveraging the periodic characteristics of modulation signals to significantly enhance the recognition accuracy of signals in complex electromagnetic environments. Reference [17] proposed a deep learning method that integrates convolutional neural networks (CNN) and long short-term memory (LSTM) modules for automatic modulation recognition, which significantly improved classification performance by learning the time-correlated features of wireless signals. Under low signal-to-noise ratio (SNR) conditions, the accuracy rate increased by 0.1% to 5.7% compared to models without LSTM. However, due to its recursive structure, it requires a longer training time. Huang et al. utilized another type of recurrent neural network with gated recurrent units (GRU) [18] to classify received signals by exploring time-related characteristics.

The Transformer is a neural network model based on the attention mechanism, widely used in natural language processing (NLP) tasks, and has become a fundamental component of modern neural network architectures [19]. It consists primarily of two parts: an encoder and a decoder. The encoder maps the input sequence into hidden representations, while the decoder translates the hidden representations into output sequences. Each encoder and decoder are composed of multiple identical layers, with each layer containing two sub-modules: a multi-head self-attention mechanism and a feedforward neural network.

In the Transformer architecture, the self-attention mechanism is key to achieving efficient and high-performance sequence processing. Through the self-attention mechanism, the model can establish connections between different positions, capturing dependencies between any positions in the input sequence, and supporting parallel processing. As a result, the Transformer network can be applied to automatic modulation recognition (AMR) tasks to improve recognition accuracy [20,21,22].

In recent years, researchers have increasingly applied the Transformer to AMR, especially by introducing complex-valued neural networks (CVNNs) [23,24], which better handle the phase and amplitude information of modulated signals. This approach not only enhances recognition accuracy but also improves the model’s robustness to complex signals.

This study combines CNNs with Transformers, leveraging CNNs to effectively extract local features of signals, such as frequency and phase, which are critical for distinguishing different modulation schemes. Meanwhile, Transformers can learn the global dependencies of signals, capturing long-range correlations in the time dimension. The combination of the two allows for a more comprehensive characterization of the modulated signals.

Compared to traditional LB and FB methods, the aforementioned methods, although having a broader scope of application and higher precision, still have some drawbacks, including the following:

(1): The aforementioned methods have long-range dependency issues when dealing with long sequence data. CNNs primarily use local receptive fields to extract features, with each layer’s neurons connected only to a subset of neurons from the previous layer. This means that at each layer, neurons can only perceive features within their local range and cannot directly access long-distance information. LSTM and GRU, as variants of recurrent neural networks (RNN), may encounter issues such as vanishing or exploding gradients when capturing long-range information in signals. As a result, these algorithms suffer from an insufficient ability to integrate global information and perform poorly when processing long-range information.
(2): There is a lack of a global information integration mechanism. Due to the limited size of the receptive field in CNNs, which is constrained by the size of the convolutional kernels, they can only extract local features of the signal. Meanwhile, LSTM and GRU models update and transmit information step by step when processing sequences, which prevents them from integrating and extracting global features.
(3): Although LSTM and GRU perform well in processing sequential data, their computations are serial, requiring each time step to be computed in order, which leads to a relatively slow training speed and higher training costs. This limits their widespread applicability in automatic modulation recognition (AMR).
(4): In AMR tasks, Convolutional Neural Networks (CNNs) and recurrent neural networks (RNNs) typically use a softmax classifier combined with a cross-entropy loss function to classify and recognize modulation signals. However, they tend to learn boundary features that can clearly distinguish different categories in the feature space and have a weaker ability to learn features with high similarity.

1.2. Motivations and Contributions

To address the above issues, this paper proposes a high-accuracy hybrid deep learning model with early layer fusion, named CTDNets. This approach introduces a Channel-Based Attention Mechanism (CBMA) into the traditional DLB model. By combining CNN and Transformer, our model not only efficiently extracts local signal features (e.g., frequency, phase, etc.) but also learns the global dependencies of the signal. Additionally, the integration of a Deep Neural Network (DNN) further enhances the model’s nonlinear fitting capability. The main contributions of this paper are as follows:

(1): Enhanced Performance in Automatic Modulation Recognition (AMR): by combining CNN, Transformer, and DNN, this study leverages the strengths of each model to improve AMR performance. Specifically, CNN is used to extract local features, enhancing the model’s understanding of complex signals. The Transformer then utilizes its global feature modeling capability to effectively capture long-range dependencies in the signals. Finally, DNN processes these features further, performing deep nonlinear mappings to improve recognition accuracy. This combination not only enhances the model’s expressive power but also improves its ability to handle long-sequence signals, resulting in outstanding performance in AMR tasks.
(2): Reduction of Feature Redundancy through CBMA: traditional CNNs often overlook relationships between features, which can lead to feature redundancy where the model learns repetitive or irrelevant features while neglecting more meaningful ones. By introducing the CBMA module, this paper dynamically adjusts feature weights, enabling the model to better focus on key features.
(3): Learning Rate Adjustment via Cosine Annealing: the study adopts a cosine annealing strategy to adjust the learning rate dynamically during training. A higher initial learning rate facilitates rapid convergence in the early stages, while a gradually decreasing learning rate allows for fine-tuning of parameters as the model approaches the optimal solution, thereby avoiding overfitting. Additionally, the smooth variation of cosine annealing aids in exploring a broader solution space in the later training stages, enhancing the model’s generalization ability. This dynamic adjustment mechanism effectively improves both training efficiency and final performance.

The content arrangement of our research is as follows: the second section provides a detailed description of the algorithm structure and its mathematical principles proposed in this study. The third section introduces the dataset we used and the simulation results analysis of the algorithm proposed in this paper. The fourth section is a summary and an outlook for future work.

2. The Proposed CTDNets Algorithm Model

This paper proposes an innovative automatic modulation recognition method based on a hybrid deep learning model, CTDNets. The method is composed of three primary modules: first, a convolutional neural network (CNN) with attention mechanisms is employed as a feature extractor to capture local features from the raw IQ signals; second, a Transformer encoder processes the extracted feature sequences to capture long-range dependencies in the temporal sequence and learn a global representation of the signal; finally, a deep neural network (DNN) classifier maps the output of the Transformer encoder to the modulation category space, enabling accurate recognition of signal modulation types. This method demonstrates exceptional robustness and recognition performance in diverse signal environments.

2.1. Feature Extractor Based on Convolutional Neural Networks

The Convolutional Neural Network (CNN) is a widely used deep learning model in the fields of image processing and signal analysis. Its primary advantage lies in efficiently extracting local features from input data through convolution and pooling operations, while also offering translational invariance and local receptive field capabilities. The CNN-based feature extractor proposed in this paper is specifically designed to capture local features of dual-channel IQ signals. Its architecture comprises two convolutional layers, two Batch Normalization (BN) layers, two Max Pooling layers, and a lightweight attention module known as the Convolutional Block Attention Module (CBAM), as depicted in Figure 1. This module is designed to enhance the model’s focus on significant regions of the signal, thereby improving the overall effectiveness and robustness of feature extraction.

First, the input IQ dual-channel signals are fed into a convolutional layer with dimensions of 2 × 1024. To increase the receptive field of the convolutional neural network, we use dilated convolution instead of traditional convolution kernels. By inserting “holes” (i.e., zeros) between standard convolution kernels, the receptive field of the kernel can be greatly increased without adding to the number of parameters. The input dimension of the first convolutional layer is 2, corresponding to the real and imaginary parts of the signal. The convolution kernel size is 3, with a padding of 2, and the number of output channels is 256. To enhance the model’s nonlinear expression capability, a Batch Normalization (BN) layer and a PReLU activation function are applied after the convolutional layer. The BN layer is used to normalize the feature maps output by the first CNN layer, accelerating network training and enhancing the model’s generalization ability. The activation function is used to enhance the model’s nonlinear expression capability, with the formula given by as follows:

X_{c o n v} = X * W + b,

(1)

X_{b n} = γ \frac{X_{c o n v} - μ_{B}}{\sqrt{σ_{B}^{2} + ε}} + β,

(2)

P R e L U (x) = {\begin{matrix} x \\ a x \end{matrix} \begin{matrix} , \\ , \end{matrix} \begin{matrix} i f \begin{matrix} x \geq 0 \end{matrix} \\ o t h e r w i s e \end{matrix} .

(3)

In this case,

X

represents the input IQ data,

W

stands for the convolution kernel,

b

is the bias,

μ_{B}

and

σ_{B}^{2}

denote the mean and variance of the mini-batch, respectively,

γ

and

β

are the learnable scaling and shifting parameters, and

α

is the learnable slope.

Subsequently, we utilized max pooling layers to downsample the feature maps. By sliding a fixed-sized window across the feature map and outputting the maximum value within the window as the new feature value, we reduced the dimensions and computational load. The max pooling layer used in this paper has a kernel size of 2 and a stride of 2, which can halve the feature size.

To enhance the model’s feature extraction capabilities and optimize feature representation, thereby improving the model’s performance in the automatic modulation recognition task, this paper introduces a mutual attention mechanism by integrating a Convolutional Block Attention Module (CBAM) between two convolutional layers. The CBAM adaptively adjusts channel and spatial information within the feature maps. By introducing both channel attention and spatial attention, CBAM enables the network to focus more effectively on critical features within the IQ data, thus improving feature representation quality and recognition performance. The formulation is as follows:

M_{c} = σ (W_{1} (A v g P o o l (X)) + W_{2} (M a x P o o l (X))),

(4)

M_{s} = σ (C o n v ([A v g P o o l (X); M a x P o o l (X)])) .

(5)

In the given context,

M_{c}

refers to the channel attention module, and

M_{s}

refers to the spatial attention module.

σ

represents the Sigmoid activation function,

W_{1}

and

W_{2}

denotes the learnable weight matrix.

Following the attention modules, an additional convolutional layer is connected. The second convolutional layer has a similar structure to the first one, designed to extract higher-level features. It has 256 input channels, a kernel size of 3, padding of 2, and produces 128 output channels. Like the first convolutional layer, the second layer is also followed by a batch normalization layer and a PReLU activation function. By stacking two CNN layers in this manner, the approach not only prevents issues like gradient vanishing and exploding, which can occur in very deep neural networks, but also allows for the progressive extraction of more abstract and robust feature representations.

2.2. Transformer Encoder

Transformer is a sequence modeling approach based on the self-attention mechanism, which was initially applied in the field of natural language processing and has since been widely used in computer vision and signal processing as well. Unlike recurrent neural networks (RNNs) and long short-term memory networks (LSTMs), the Transformer can perform parallel computations, offering faster training speeds, and is also capable of capturing long-range dependencies within sequences. In the automatic modulation recognition method proposed in this paper, we utilize the Transformer to learn the global dependency relationships of the signals. The Transformer is composed of multiple encoder layers, each of which includes a Multi-Head Self-Attention mechanism, a feed forward neural network, and two residual connections and Layer Normalization components. The Multi-Head Self-Attention mechanism can calculate the correlations between elements in the sequence across different subspaces, capturing global dependency relationships within the sequence. The feed forward neural network enhances the model’s nonlinear expression capabilities. The residual connections and Layer Normalization help to prevent gradient vanishing and exploding, thereby enhancing the stability of the Transformer encoder. The structure of the Transformer encoder is illustrated in Figure 2.

Before inputting the local features extracted by the convolutional neural network into the Transformer, we need to perform a dimension transformation on the feature maps. The dimension of the feature maps is changed from (batch size, channels, length) to (length, batch size, channels), where length = 256 corresponds to the time steps of the signal, batch size = 256 corresponds to the batch size, and channels = 128 correspond to the number of channels in the feature maps. The purpose of doing this is to use the time steps as the input sequence length for the Transformer, allowing the Transformer to learn the global dependencies of the signal in the time dimension. This paper uses four stacked encoder layers, each with 8 attention heads. For the input features

X

, the self-attention formula is as follows:

Q = X W_{Q}, K = X W_{K}, V = X W_{V},

(6)

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V .

(7)

In the formula,

Q, K, V

represent the queries, keys, and values of the Transformer, respectively, which are derived from the output of a feature extractor based on a convolutional neural network through a linear transformation.

W_{Q}, W_{K}, W_{V}

are the parameter matrices. The multi-head self-attention mechanism performs this process in parallel eight times, obtaining eight attention matrices that are concatenated and linearly transformed. The formula for this is as follows:

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, h e a d_{2}, \dots, h e a d_{8}),

(8)

where,

h e a d_{h} = A t t e n t i o n (Q W_{Q}^{h}, K W_{K}^{h}, V W_{V}^{h})

.

After that, it goes through a residual connection and layer normalization, and is then input into a feedforward neural network. In the feedforward neural network, we use two fully connected layers with ReLU as the activation function in between. To enhance the model’s generalization capability and prevent overfitting, we add dropout regularization between each sub-layer, with a dropout rate set to 0.1.

2.3. Classifier Based on Deep Neural Networks

Finally, a DNN receives the output features from the Transformer encoder and maps them to different modulation categories. To classify the modulated signals, we need to convert the output of the Transformer encoder into class probabilities. First, we perform average pooling on the output of the Transformer encoder in the time dimension, obtaining a feature vector of size (batch size, channels). This feature vector contains the global features of the input signal and can effectively represent the overall information of the signal. It is then input into a fully connected neural network and the final classification results are output through linear transformation. The input dimension of the fully connected layer is channels, and the output dimension is 24, corresponding to the number of modulation signal categories. Through the fully connected layer, we can obtain the probability distribution of each category. In the task of automatic modulation recognition, the classification result indicates the modulation method of the input signal. By training the model, the fully connected layer is able to accurately predict the modulation method of the signal based on the input feature vector.

During the training process, we use the cross-entropy loss function to measure the difference between the predicted probabilities and the true labels. Define the model’s output as

\hat{y}

and the true label as

y

, then the loss function is as follows:

L o s s = - \sum_{i} y_{i} \log ({\hat{y}}_{i}) .

(9)

In order to enhance the model’s convergence performance, this paper adopts a learning rate adjustment strategy based on cosine annealing, and the learning rate update formula is:

α_{t} = α_{\min} + \frac{1}{2} (α_{\max} - α_{\min}) (1 + \cos (\frac{t}{T} π)),

(10)

in which

α_{t}

is the learning rate at step

t

.

α_{\max}

and

α_{\min}

are the maximum and minimum values of the learning rate, respectively, which define the range of learning rate variation.

t

is the number of training cycles that have been completed so far.

T

is the length of the restart period, meaning that within one period, the learning rate will change through a complete cycle according to the cosine function.

3. Experimental Results and Analysis

3.1. Datasets

The experiment uses the RadioML2016.10a and RadioML2018.01a datasets to simulate and verify the algorithm. The RadioML2016.10a dataset includes 11 common modulation types, with 1000 samples collected for each type at every 2 dB interval from −20 dB to 18 dB in the SNR range. Each sample contains 128 complex sampling points, representing the I and Q components of the signal. The RadioML2018.01a dataset includes 24 common modulation types, with 98,304 samples collected for each type at every 2 dB interval from −20 dB to 30 dB, and each sample contains 1024 sampling points. The detailed parameters of the datasets are shown in Table 1.

3.2. Simulation Results Analysis

This paper conducts simulation experiments on the RadioML2016.10a and RadioML2018.01a datasets. The simulations were performed on an i7-14700KF CPU. Additionally, to accelerate model training, a GTX4060 GPU with 8 GB of memory was used in this experiment.

Figure 3 and Figure 4 illustrate the impact of different learning rates on the algorithm’s performance. Observing these figures reveals that the algorithm achieves optimal performance when the learning rate is set to 0.001. This is because an excessively large learning rate can lead to instability in model training, causing the loss function to oscillate near the optimal solution or even diverge, potentially resulting in gradient explosion. Such issues prevent the model from effectively learning, leading to a failure to converge and ultimately degrading recognition performance. Conversely, a learning rate that is too small slows down the model’s convergence, significantly increasing training time and potentially causing the model to stagnate near a local optimum, which results in suboptimal performance. In such cases, the model undergoes minimal updates, and the training loss changes only slightly, making it difficult to achieve the desired recognition accuracy.

To further evaluate the performance of the proposed algorithm, comparisons were conducted with the following three algorithms:

ClST [25]: this algorithm integrates the advantages of convolution and Transformer by incorporating convolutional down-sampling blocks and convolutional Transformer blocks. It employs knowledge distillation to transfer the generalization capabilities of a complex model to a smaller model. Using soft targets and novel loss functions, it trains the smaller model to be deployable on resource-constrained devices.

CLDNN [26]: by combining CNN, LSTM, and DNN, this algorithm is capable of handling sequential data while leveraging deep learning techniques to enhance the recognition of complex radar signals.

TSTR [27]: the structure of this algorithm comprises three main layers: Input Preprocessing Layer (IPPL): This layer preprocesses received signals through time-frequency and I/Q preprocessing to produce time-frequency diagrams and I/Q values as inputs. Feature Capture Layer (FCL): it reduces noise and rejects redundant features using multi-head self-attention and adaptive soft thresholding while extracting contextual information. Multi-scale ghost convolution is also applied to capture the spatial features of signals. Classification Layer (CL): the features extracted from the FCL are classified to estimate the modulation scheme of the signals.

CTDNet without CBMA: this is an ablation version of the proposed CTDNet algorithm, where the CBMA module is removed to evaluate its impact on algorithm performance. This experiment aims to validate the contribution of the CBMA attention module to the overall performance.

Figure 5 illustrates the recognition accuracy comparison between the proposed CTDNets and three other modulation recognition algorithms on the RML2016.10a dataset. The horizontal axis represents the signal-to-noise ratio (SNR) in dB, while the vertical axis denotes recognition accuracy. From Figure 5, it can be observed that TSTR achieves the highest accuracy at an SNR of −10 dB compared to the other algorithms. This is because TSTR processes the received signals using time-frequency and I/Q preprocessing in the input preprocessing layer, generating time-frequency diagrams and I/Q values as inputs. With multi-head self-attention and adaptive soft-thresholding techniques, TSTR effectively reduces noise and minimizes redundant features, contributing to its superior performance at low SNRs. However, as the SNR increases, CLDNN outperforms other algorithms in the range of −8 dB to −4 dB. In contrast, CTDNets consistently exhibit the best performance in the SNR range from −4 dB to 10 dB. This superiority can be attributed to its innovative structure combining convolutional layers with a Transformer encoder, which significantly enhances its feature extraction capabilities. Overall, CTDNets demonstrate the most outstanding performance across the entire SNR range for the modulation recognition task. At an SNR of 0 dB, CTDNets improve recognition accuracy by 14.03% compared to ClST, 11.01% compared to CLDNN, and 2.33% compared to TSTR. Similarly, at an SNR of 10 dB, CTDNets achieve recognition accuracy improvements of 10.99% over ClST, 12.41% over CLDNN, and 1.45% over TSTR.

Figure 6 presents the recognition accuracy comparison between the proposed CTDNets and three other modulation recognition algorithms on the RML2018.01a dataset. From Figure 6, it can be observed that TSTR outperforms the other algorithms in the SNR range of −10 dB to −5 dB. This is due to TSTR’s input preprocessing layer, which applies time-frequency and I/Q preprocessing to the received signals, generating time-frequency diagrams and I/Q values as inputs. By leveraging multi-head self-attention and adaptive soft-thresholding techniques, TSTR effectively reduces noise and eliminates redundant features, resulting in superior recognition accuracy in low-SNR environments compared to CTDNets. However, in the SNR range of −5 dB to 15 dB, CTDNets achieve significantly higher recognition accuracy than the other algorithms. This improvement can be attributed to CTDNets’ innovative design, which combines the strengths of convolutional layers and a Transformer encoder, effectively extracting spatiotemporal features from signals. Additionally, the CBMA module enhances feature representation, further boosting recognition accuracy. Overall, CTDNets demonstrate outstanding performance under various SNR conditions, with particularly notable advantages when the SNR exceeds −5 dB. This makes CTDNets highly valuable for practical applications in complex channel environments. At an SNR of 0 dB, CTDNets improve recognition accuracy by 80.86% compared to ClST, 33.83% compared to CLDNN, and 2.96% compared to TSTR. At an SNR of 10 dB, CTDNets achieve recognition accuracy improvements of 22.59% over ClST, 9.12% over CLDNN, and 5.20% over TSTR.

By comparing the blue and black curves in Figure 5 and Figure 6, we can clearly see that the recognition accuracy of CTDNet without the CBMA module is significantly lower than that of the proposed algorithm. This is because the CBMA module enhances feature representation through an attention mechanism, allowing the model to focus more effectively on the most important parts of the input data. This feature enhancement helps subsequent CNN and Transformer layers better capture key patterns in the data. Additionally, the CBMA module may have noise suppression capabilities, helping the model filter out irrelevant or harmful features, thereby increasing sensitivity to useful signals. Through multi-scale feature fusion, the CBMA module can integrate features at different scales, providing a more comprehensive understanding of the input data and enhancing the model’s nonlinear expression capabilities and robustness. It may also improve gradient flow, ensuring stability and efficiency during the training process, ultimately boosting the model’s generalization ability to maintain high performance on unseen data. These combined effects enable networks with the CBMA module to achieve higher recognition accuracy in automatic modulation recognition tasks.

4. Conclusions

To enhance the accuracy of automatic modulation recognition in wireless communication systems, optimize demodulation and signal processing workflows, minimize misclassification and decoding errors, and improve overall communication efficiency under varying channel conditions, this study proposes an automatic modulation recognition algorithm based on CTDNets. The algorithm is designed with a convolutional neural network (CNN)-based feature extractor, a Transformer-based encoder, and a deep neural network (DNN)-based classifier. By combining the CNN’s capability to capture local features with the Transformer’s ability to handle global information, the model captures signal characteristics comprehensively. Additionally, the CBMA module is introduced to enhance salient features and reduce the impact of irrelevant ones, thus improving model efficiency and accuracy. Experimental results demonstrate that this approach significantly enhances modulation recognition accuracy compared to methods such as ClST, CLDNN, and TSTR.

Author Contributions

Conceptualization, Z.Z. and X.Z.; Methodology, Z.Z. and Y.Q.; Software, Z.Z.; Validation, Z.Z., Y.Q. and J.L.; Formal analysis, Z.Z. and L.Z.; Investigation, Z.Z., X.Z., J.L. and H.J.; Resources, Y.Q., X.Z., Y.Z. and H.J.; Data curation, Y.Q. and L.Z.; Writing—original draft, Y.Q. and H.J.; Writing—review & editing, Y.Z., L.Z., J.L. and H.J.; Supervision, Z.Z.; Project administration, Z.Z.; Funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China (No. 52301392), and High-level Scientific and Technological Innovation Personnel Project (No. KJKT-RC-2201).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, F.; Luo, C.; Xu, J.; Luo, Y.; Zheng, F.-C. Deep learning based automatic modulation recognition: Models, datasets, and challenges. Digit. Signal Process. 2022, 129, 103650. [Google Scholar] [CrossRef]
Xiao, W.; Luo, Z.; Hu, Q. A review of research on signal modulation recognition based on deep learning. Electronics 2022, 11, 2764. [Google Scholar] [CrossRef]
Motwani, Y.; Saraswat, P.; Aggarwal, S.; Awari, R.M.; Bagubali, A. Analysis of various neural network architectures for automatic modulation techniques. In Proceedings of the 2021 Innovations in Power and Advanced Computing Technologies (i-PACT), Kuala Lumpur, Malaysia, 27–29 November 2021; pp. 1–7. [Google Scholar]
Rao, N.V.; Krishna, B.T. Automatic modulation recognition using machine learning techniques: A review. In Advances in VLSI, Signal Processing, Power Electronics, IoT, Communication and Embedded Systems: Select Proceedings of VSPICE 2020; Springer: Singapore, 2021; pp. 145–154. [Google Scholar]
Huang, S.; Yao, Y.; Wei, Z.; Feng, Z.; Zhang, P. Automatic modulation classification of overlapped sources using multiple cumulants. IEEE Trans. Veh. Technol. 2016, 66, 6089–6101. [Google Scholar] [CrossRef]
Ghasemzadeh, P.; Banerjee, S.; Hempel, M.; Sharif, H. Performance evaluation of feature-based automatic modulation classification. In Proceedings of the 2018 12th International Conference on Signal Processing and Communication Systems (ICSPCS), Cairns, QLD, Australia, 17–19 December 2018; IEEE: New York, NY, USA, 2018; pp. 1–5. [Google Scholar]
Al-Nuaimi, D.H.; Hashim, I.A.; Zainal Abidin, I.S.; Salman, L.B.; Mat Isa, N.A. Performance of feature-based techniques for automatic digital modulation recognition and classification—A review. Electronics 2019, 8, 1407. [Google Scholar] [CrossRef]
Li, P.; Wang, L. Combined neural network based on deep learning for amr. In Proceedings of the 2021 7th International Conference on Computer and Communications (ICCC), Chengdu, China, 10–13 December 2021; pp. 1244–1248. [Google Scholar]
Zhang, X.; Luo, Z.; Xiao, W.; Feng, L. Deep Learning-Based Modulation Recognition for MIMO Systems: Fundamental, Methods, Challenges. IEEE Access 2024, 12, 112558–112575. [Google Scholar] [CrossRef]
Xu, T.; Ma, Y. Signal Automatic Modulation Classification and Recognition in View of Deep Learning. IEEE Access 2023, 11, 114623–114637. [Google Scholar] [CrossRef]
Liu, F.; Zhang, Z.; Zhou, R. Automatic modulation recognition based on CNN and GRU. Tsinghua Sci. Technol. 2021, 27, 422–431. [Google Scholar] [CrossRef]
O’Shea, T.J.; Corgan, J.; Clancy, T.C. Convolutional radio modulation recognition networks. In Proceedings of the Engineering Applications of Neural Networks: 17th International Conference, EANN 2016, Aberdeen, UK, 2–5 September 2016; Proceedings 17. Springer International Publishing: Cham, Switzerland, 2016; pp. 213–226. [Google Scholar]
O’Shea, T.J.; Roy, T.; Clancy, T.C. Over-the-air deep learning based radio signal classification. IEEE J. Sel. Top. Signal Process. 2018, 12, 168–179. [Google Scholar] [CrossRef]
Tan, X.; Xie, Z.; Yuan, X.; Yang, G.; Han, Y. A Residual Neural Network for Modulation Recognition of 24 kinds of Signals. In Proceedings of the 2022 3rd International Conference on Computing, Networks and Internet of Things (CNIOT), Qingdao, China, 20–22 May 2022; pp. 140–145. [Google Scholar]
Yang, J.; Peng, Y.; Zhou, Y.; Liu, L.; Qi, Y. SNR-Aware automatic modulation recognition based on modified deep residual networks. In Proceedings of the 2022 IEEE 95th Vehicular Technology Conference:(VTC2022-Spring), Helsinki, Finland, 19–22 June 2022; pp. 1–5. [Google Scholar]
Zhou, F.; Li, J.; Wang, Y. An improved CNN-LSTM network for modulation identification relying on periodic features of signal. IET Commun. 2023, 17, 2097–2106. [Google Scholar] [CrossRef]
Zhou, Q.; Jing, X.; He, Y.; Cui, Y.; Kadoch, M.; Cheriet, M. LSTM-based automatic modulation classification. In Proceedings of the 2020 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Paris, France, 27–29 October 2020; pp. 1–4. [Google Scholar]
Huang, S.; Dai, R.; Huang, J.; Yao, Y.; Gao, Y.; Ning, F.; Feng, Z. Automatic modulation classification using gated recurrent residual network. IEEE Internet Things J. 2020, 7, 7795–7807. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Hamidi-Rad, S.; Jain, S. Mcformer: A transformer based deep neural network for automatic modulation classification. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021. [Google Scholar]
Liang, Z.; Tao, M.; Xie, J.; Yang, X.; Wang, L. A Radio Signal Recognition Approach Based on Complex-Valued CNN and Self-Attention Mechanism. IEEE Trans. Cogn. Commun. Netw. 2022, 8, 1358–1373. [Google Scholar] [CrossRef]
Kong, W.; Yang, Q.; Jiao, X.; Niu, Y.; Ji, G. A Transformer-based CTDNN Structure for Automatic Modulation Recognition. In Proceedings of the 2021 7th International Conference on Computer and Communications (ICCC), Chengdu, China, 10–13 December 2021; pp. 159–163. [Google Scholar] [CrossRef]
Lei, J.; Li, Y.; Yung, L.-Y.; Leng, Y.; Lin, Q.; Wu, Y.-C. Understanding Complex-Valued Transformer for Modulation Recognition. IEEE Wirel. Commun. Lett. 2024. [Google Scholar] [CrossRef]
Li, W.; Deng, W.; Wang, K.; You, L.; Huang, Z. A Complex-Valued Transformer for Automatic Modulation Recognition. IEEE Internet Things J. 2024, 11, 22197–22207. [Google Scholar] [CrossRef]
Hou, D.; Li, L.; Lin, W.; Liang, J.; Han, Z. ClST: A Convolutional Transformer Framework for Automatic Modulation Recognition by Knowledge Distillation. IEEE Trans. Wirel. Commun. 2024, 23, 8013–8028. [Google Scholar] [CrossRef]
Xu, J.; Luo, C.; Parr, G.; Luo, Y. A Spatiotemporal Multi-Channel Learning Framework for Automatic Modulation Recognition. IEEE Wirel. Commun. Lett. 2020, 9, 1629–1632. [Google Scholar] [CrossRef]
Li, J.; Jia, Q.; Cui, X.; Gulliver, T.A.; Jiang, B.; Li, S.; Yang, J. Automatic Modulation Recognition of Underwater Acoustic Signals Using a Two-Stream Transformer. IEEE Internet Things J. 2024, 11, 18839–18851. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the feature extractor network structure based on convolutional neural networks.

Figure 2. Schematic diagram of the Transformer-based encoder network structure.

Figure 3. The accuracy comparison chart of the proposed algorithm under different learning rates on the RadioML2016.10a dataset.

Figure 4. The accuracy comparison chart of the proposed algorithm under different learning rates on the RML2018.01a dataset.

Figure 5. Accuracy comparison between the proposed algorithm and other baseline algorithms on the RadioML2016.10a dataset.

Figure 6. Accuracy comparison between the proposed algorithm and other baseline algorithms on the RML2018.01a dataset.

Table 1. Dataset Statistics.

Features	RadioML2016.10a	RadioML2018.01a
Number of Modulation Types	11	24
Total Number of Samples in the Dataset	220,000	2,359,296
Number of Samples per Modulation Type	20,000	98,304
SNR Range	−18 dB~18 dB	−20 dB~30 dB
Number of Signal Sampling Points	128	1024
Modulation Type	BPSK, QPSK, 8PSK, 16QAM, 64QAM, 256QAM, BFSK, CPFSK, GFSK, PAM4, WBFM	BPSK, QPSK, 8PSK, 16QAM, 64QAM, 256QAM, AM-DSB, AM-SSB, AM-SSB-SC, BPSK, CPFSK, GFSK, PAM4, QAM16, QAM64, QAM256, QPSK

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Z.; Qu, Y.; Zhou, X.; Zhu, Y.; Zhang, L.; Lin, J.; Jiang, H. CTDNets: A High-Precision Hybrid Deep Learning Model for Modulation Recognition with Early-Stage Layer Fusion. Electronics 2024, 13, 4641. https://doi.org/10.3390/electronics13234641

AMA Style

Zhao Z, Qu Y, Zhou X, Zhu Y, Zhang L, Lin J, Jiang H. CTDNets: A High-Precision Hybrid Deep Learning Model for Modulation Recognition with Early-Stage Layer Fusion. Electronics. 2024; 13(23):4641. https://doi.org/10.3390/electronics13234641

Chicago/Turabian Style

Zhao, Zhiyuan, Yi Qu, Xin Zhou, Yiyong Zhu, Li Zhang, Jirui Lin, and Haohui Jiang. 2024. "CTDNets: A High-Precision Hybrid Deep Learning Model for Modulation Recognition with Early-Stage Layer Fusion" Electronics 13, no. 23: 4641. https://doi.org/10.3390/electronics13234641

APA Style

Zhao, Z., Qu, Y., Zhou, X., Zhu, Y., Zhang, L., Lin, J., & Jiang, H. (2024). CTDNets: A High-Precision Hybrid Deep Learning Model for Modulation Recognition with Early-Stage Layer Fusion. Electronics, 13(23), 4641. https://doi.org/10.3390/electronics13234641

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CTDNets: A High-Precision Hybrid Deep Learning Model for Modulation Recognition with Early-Stage Layer Fusion

Abstract

1. Introduction

1.1. Related Work

1.2. Motivations and Contributions

2. The Proposed CTDNets Algorithm Model

2.1. Feature Extractor Based on Convolutional Neural Networks

2.2. Transformer Encoder

2.3. Classifier Based on Deep Neural Networks

3. Experimental Results and Analysis

3.1. Datasets

3.2. Simulation Results Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI