Dual-Modality Transformer with Time Series Imaging for Robust Epileptic Seizure Prediction

Qin, Jiahao; Liu, Zijia; Zhuang, Jihong; Liu, Feng

doi:10.3390/app15031538

Open AccessArticle

Dual-Modality Transformer with Time Series Imaging for Robust Epileptic Seizure Prediction

by

Jiahao Qin

^1,2

,

Zijia Liu

^3,†,

Jihong Zhuang

^2,† and

Feng Liu

^4,*

¹

Faculty of Science and Engineering, University of Liverpool, Liverpool L69 7ZX, UK

²

School of Mathematics and Physics, Xi’an Jiaotong-Liverpool University, Suzhou 215000, China

³

School of Management, University of Liverpool, Liverpool L69 7ZX, UK

⁴

School of Computer Science and Technology, East China Normal University, Wuhan 200050, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(3), 1538; https://doi.org/10.3390/app15031538

Submission received: 27 November 2024 / Revised: 24 January 2025 / Accepted: 28 January 2025 / Published: 3 February 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Automated EEG classification algorithms for seizures can facilitate the clinical diagnosis of epilepsy, enabling more expedient and precise classification. However, existing EEG signal preprocessing methods oriented towards artifact removal and signal enhancement have demonstrated suboptimal accuracy and robustness. In response to this challenge, we propose an Adaptive Dual-Modality Learning Model (ADML) for epileptic seizure prediction by combining time series imaging with Transformer-based architecture. Our approach effectively captures both temporal dependencies and spatial relationships in EEG signals through a specialized attention mechanism. Evaluated on the CHB-MIT and Bonn datasets, our method achieves 98.7% and 99.2% accuracy, respectively, significantly outperforming existing approaches. The model demonstrates strong generalization capability across datasets while maintaining computational efficiency. Cross-dataset validation confirms the robustness of our approach, with consistent performance above 96% accuracy. These results suggest that our dual-modality approach provides a reliable and practical solution for clinical epileptic seizure prediction.

Keywords:

dual-modality Transformer; time series imaging; robust epileptic seizure prediction; chronic neurological disorder; EEG; deep learning; computational perception

1. Introduction

Epilepsy represents one of the most prevalent chronic neurological disorders globally, affecting over 50 million individuals worldwide [1,2,3]. This condition manifests through severe clinical symptoms, including limb convulsions, loss of consciousness, and syncope, profoundly impacting patients’ health and quality of life [1,4,5]. The unpredictable, recurrent, and sudden nature of these manifestations creates significant challenges for both patients and healthcare providers, underscoring the critical importance of accurate classification and diagnosis in clinical treatment.

Electroencephalography (EEG) has emerged as the cornerstone of epilepsy diagnosis, distinguished by its non-invasive nature, accessibility, and cost-effectiveness [6,7,8,9,10]. While traditional diagnostic approaches rely on expert clinicians analyzing extensive EEG datasets [7], this methodology often yields subjective and variable conclusions, creating a substantial workload for medical professionals. The advent of machine learning and deep learning technologies has revolutionized epilepsy classification, enabling automated EEG analysis systems that facilitate more efficient and precise diagnosis. However, the inherent complexity of EEG signals—characterized by non-stationarity and subtle feature variations [7,9,11,12]—necessitates sophisticated preprocessing techniques before applying deep learning models.

Current preprocessing approaches predominantly employ wavelet transforms and short-time Fourier transforms [12,13], though these methods present significant limitations. The efficacy of wavelet transformation critically depends on wavelet function selection, with no universal solution suitable for all signal types. This challenge becomes particularly acute when processing complex EEG signals, where wavelet function choice substantially impacts analytical outcomes. Furthermore, these transforms impose considerable computational overhead, generating large coefficient matrices that demand substantial storage capacity, especially for high-resolution data processing. The critical nature of scale selection in wavelet functions adds another layer of complexity, where suboptimal choices can result in the loss of crucial signal features [14,15,16]. Short-time Fourier transforms face similar challenges, requiring significant computational resources and careful window size selection while struggling to effectively capture rapidly changing or transient signal characteristics. These parameter-dependent preprocessing methods ultimately compromise algorithmic robustness and clinical practicality through their sensitivity to parameter selection and functional choices.

In addition to common preprocessing algorithms, including traditional methods such as wavelet transform and short-time Fourier transform, many newer methods have been proposed for processing non-smooth signals. The Empirical Wavelet Transform-based filtering method has shown superior performance in preserving signal characteristics while effectively removing artifacts [17]. This method adaptively constructs a wavelet filter based on the spectral content of the signal and provides a higher signal-to-noise ratio compared to conventional wavelet methods. Additionally, modern deep learning-based denoising approaches have emerged as powerful alternatives to traditional methods. Autoencoders with skip connections have shown remarkable capability in preserving important EEG features while removing noise [18]. Attention-based denoising networks that can focus on relevant signal components while suppressing artifacts have achieved great performance in EEG signal enhancement [19]. However, these preprocessing methods still face challenges in parameter optimization and computational efficiency, particularly when dealing with real-time applications.

In response to these challenges, many researchers have attempted to improve the ability to identify epilepsy from EEG signals using alternative methods, such as time series imaging, which transforms one-dimensional EEG signals into two-dimensional visual representations, providing a novel approach to analyze brain activity patterns. This transformation allows us to leverage advanced image analysis techniques while preserving critical temporal information from the original EEG recordings [20,21,22]. Kucukler et al. imaged EEG signals through the Gram Angle field, then inputted the image data into a hybrid Convolution Neural Network (CNN)–Long-Short Term Memory (LSTM) model (CNN-LSTM) for feature extraction, achieving an accuracy of 100% in F8 single-channel emotion recognition [20]. Bore et al. developed DeepBraiNNet, an innovative approach that transforms EEG recordings into visual representations while preserving their key temporal characteristics. This method specifically addresses the challenge of motion-related artifacts in EEG signals, producing clear visualizations of brain activity patterns during motor imagery tasks [21], which, by incorporating LSTM, greatly reduces artifact effects in original EEG signals. When applied to real motion imagination datasets, this method produced clear, sparse Motor Imagery (MI)-related activation patterns. Li et al. converted the time series into a recursive graph and then extracted local features from it using a computer vision algorithm [23]. Extracted features are used to predict model averages. In summary, various time-series imaging techniques have been employed to encode one-dimensional time series into two-dimensional images [20,21,22]. However, these techniques often fail to retain the temporal dynamic properties of EEG signals, which are crucial for distinguishing between different types of epilepsy. In this context, the Markov Transition Field (MTF) transformation may be suitable, as it preserves the statistical and temporal dynamic properties of one-dimensional data and has been successfully applied to create two-dimensional representations of one-dimensional EEG signals. Moreover, the Markov Transition Field (MTF) transformation’s ability to reconstruct the original time series from its representation (inverse nature) allows for effective pattern exploration in the transformed space [24]. Therefore, MTF has been considered for use in generating two-dimensional input images to classify both individual and potential features in EEG signals associated with different seizure types.

The Markov Transition Field (MTF) represents a significant advancement in time series analysis, offering a sophisticated two-dimensional representation of one-dimensional temporal data [24,25,26]. Beyond simple visualization, MTF captures complex transition probabilities between discrete states while preserving crucial temporal dynamics and statistical properties of the original signal. Recent applications have demonstrated its effectiveness in epilepsy classification. Notably, Shankar et al. achieved 91% accuracy in a challenging six-class epilepsy classification task by combining MTF-based signal representation with CNN architecture [27]. Similarly, Li et al. extended this approach to sEMG signals, developing a Deep Neural Network (DNN) framework that achieved 91.02% accuracy in emotion classification [28]. These successful implementations validate MTF’s capability to capture and preserve essential signal characteristics across different biomedical applications.

To address these challenges and leverage the complementary strengths of both temporal and spatial information, we propose an Adaptive Dual-Modal Learning (ADML) model incorporating a novel Synergistic Modal Integration (SYNI) mechanism. Our research advances the field of epileptic seizure prediction through three main contributions:

We introduce a dual-stream architecture that simultaneously analyzes both raw EEG signals and their visual representations. The raw signals preserve detailed temporal patterns, while the visual representations capture broader activity patterns.
We develop a novel integration mechanism (SYNI) that intelligently combines information from both streams, allowing our model to capture complex seizure patterns that might be missed when analyzing either stream alone.
Our approach demonstrates superior accuracy (98.7% on CHB-MIT and 99.2% on Bonn datasets) while maintaining computational efficiency, making it practical for clinical applications.

This study aims to develop a robust epileptic seizure prediction model by combining time series imaging with advanced deep learning architectures. We hypothesize that integrating temporal and spatial features through our proposed dual-modality approach will significantly improve prediction accuracy while maintaining computational efficiency.

2. Related Work

The commonly used time-frequency and frequency domain analysis tools are mainly several transforms, which are Short-Time Fourier Transform (STFT) [29], continuous wavelet transform [7], Discrete Wavelet Transform (DWT) [30], Hilbert–Huang transform [31], empirical mode decomposition [32], Q-wavelet transformation [33], and Mean Amplitude Spectrum (MAS) [34].

Recent advancements in time-frequency analysis have introduced more sophisticated approaches that address the limitations of classical methods. Modern variants of wavelet transforms, such as the Synchrosqueezing Wavelet Transform (SST), have demonstrated superior time–frequency resolution and robustness to noise. The SST achieves this by reassigning the wavelet coefficients based on their instantaneous frequencies, resulting in sharper and more interpretable time–frequency representations [35]. Deep learning has also revolutionized time–frequency analysis through learned representations. Time–Frequency–Deep Neural Networks (TF-DNNs) [36,37] learn optimal time–frequency decompositions directly from data, outperforming fixed basis transformations like STFT and DWT. These networks adaptively adjust the frequency response according to the characteristics of the input signal, providing a significant improvement in feature discrimination compared to conventional methods.

Traditional machine learning algorithms need to manually extract the features in the original signal and then input the original signals into the model for signal recognition and classification with the extracted features. In general, traditional machine learning classification methods include K-nearest neighbor method(KNN) [38], Support Vector Machine (SVM) [38], Artificial Neural Network (ANN) [39], and Fuzzy classifier [40]. However, the traditional machine learning algorithm is comparatively perfect. Some problems still exist that can not be avoided in the research. For example, When dealing with a task that needs to process a large amount of data, traditional machine learning algorithms face the difficulty of feature extraction. At the same time, if the original data need to be upgraded, it is a challenge for the machine learning algorithm to accurately identify and classify the upgraded data, such as images.

Convolutional Neural Networks (CNNs) have demonstrated remarkable effectiveness in classifying EEG signal states for epilepsy detection and recognition. Unlike traditional machine learning approaches that require manual feature engineering, CNNs leverage localized receptive fields through convolution operations to automatically learn hierarchical representations directly from raw data. This automatic feature extraction capability preserves critical signal characteristics that might otherwise be lost in manual preprocessing. Through successive convolution and pooling operations, CNNs can extract multi-level features spanning low-level signal properties to high-level abstract patterns from EEG data [41]. The ability to capture deep, abstract feature representations enables CNNs to achieve superior classification accuracy compared to conventional machine learning methods, particularly for complex non-stationary signals like EEG, where manual feature extraction proves especially challenging.

Recurrent Neural Networks (RNNs), first formalized mathematically by Gelenbe [42], introduced the crucial concept of cyclic connections in neural architectures. This design enables each node’s output to depend not only on current inputs but also on previous hidden states, allowing the network to maintain temporal memory in sequential data processing. By propagating information across time steps, RNNs can model temporal dependencies in the data sequence. However, standard RNNs encounter gradient vanishing problems when modeling long-term dependencies. Long Short-Term Memory (LSTM) networks, originally proposed as a variant of RNNs [43], were specifically designed to address these limitations. LSTMs excel at extracting temporal patterns from EEG signals for epilepsy detection. To overcome the challenges of gradient instability with long input sequences, LSTMs implement a circular memory structure where hidden layers maintain connections to both adjacent nodes and themselves. This architecture enables LSTMs to effectively map historical input sequences to outputs, theoretically approximating arbitrary temporal sequence transformations.

Contemporary research increasingly favors hybrid architectures that combine complementary strengths of different deep learning models. Notable examples include CNN-LSTM [44] and RNN-LSTM [45] architectures, which effectively address challenges such as limited temporal memory and gradient instability. These integrated approaches demonstrate enhanced feature extraction capabilities and superior classification performance. For processing time series data, several imaging transformation techniques have emerged as powerful tools, including recursive graphs [23], Gramian Angular Fields (both summation and difference variants, GASF/GADF) [46], and Markov transition fields [47].

Gao et al. converted EEG to images using time series image conversion algorithm, Recurrence Plot (RP), and Gramian Angular Field (GAF) [48]. The Convolutional Neural Network (CNN) model based on VGGNet was used to learn the converted EEG images. The results show that the use of GAF and CNN models based on EEG results can effectively improve the objectivity and efficiency of the diagnosis of various mental disorders, including schizophrenia. Zhao et al. converted one-dimensional EEG signals into two-dimensional images using a Gram Angular Difference Field (GADF) and, based on transfer learning, employed a Domain Adaptive Network with Channel Attention (CADAN) composed of a channel attention mechanism and domain adaptive network for cross-subject emotion recognition [49]. It also achieved a breakthrough in the accuracy of emotion recognition tasks. Shankar et al. used two different signal inputs (e.g., EEG signals and their instantaneous power) for image generation in two different ways—the Gram Angle Summation Field (GASF) and the Gram Angle Differential Field (GADF) [50]. An EEG dataset from the University of Bonn was used for experimental verification. The experimental results show that the classification accuracy reaches a high record, and the efficiency of the proposed method is evaluated. Shankar et al.’s experiment directly applied time series imaging to the epilepsy classification research, demonstrating the excellent effect of time series imaging in epilepsy classification research, which revealed the usefulness of GAF in the deep learning model for the classification of epilepsy. These studies provide a novel research perspective where deep learning is combined with time series imaging. In the above studies, the EEG signals were imaged by using the Gramian Angle field and then classified by a deep learning model. This study proves the robustness of time series imaging in data processing without using traditional data processing methods, such as wavelet transform or Fourier transform; deep learning models can also accurately classify different types of data using the images. This inspired us to use the Markov transition field to image EEG data.

Prior works have made significant contributions to EEG-based epilepsy detection, yet our approach addresses several fundamental limitations in existing methods. Traditional preprocessing approaches like wavelet transforms and short-time Fourier transforms require careful parameter tuning and often lose critical temporal information. Current deep learning methods typically focus on either temporal or spatial features independently, missing the complex interplay between these modalities. Additionally, existing multi-modal approaches often employ simple concatenation or weighted averaging for feature fusion, which fails to capture the dynamic relationships between different feature representations.

Our work advances the state-of-the-art method in several key aspects. Unlike previous time-series imaging approaches that focus solely on spatial representation, our dual-modality model preserves both temporal dynamics from raw EEG and spatial patterns from MTF images, enabling more comprehensive feature capture. In contrast to existing Transformer-based methods that process single-modal data, our parallel temporal–spatial architecture with specialized attention mechanisms allows for more effective feature extraction from both domains simultaneously. While previous fusion strategies rely on static combination rules, our Synergistic Modal Integration (SYNI) module dynamically adapts the integration weights based on input characteristics, leading to more robust and context-aware feature fusion. Compared to existing knowledge distillation approaches in EEG analysis, our teacher–student model specifically addresses the computational efficiency challenges while maintaining high accuracy, making it more suitable for real-world clinical applications. These innovations collectively address the key limitations of existing methods, particularly in terms of feature comprehensiveness, integration sophistication, and practical deployability.

3. Methodology

Our model, called Adaptive Dual-Modal Learning (ADML), processes EEG data through two complementary pathways. The first pathway analyzes the temporal patterns in raw EEG signals, while the second pathway examines visual patterns created from these signals using the Markov Transition Field technique. This dual-stream approach allows us to capture different aspects of seizure activity that might be missed by traditional single-stream methods. Figure 1 illustrates the architecture of our proposed model.

Our model consists of two main streams working in parallel: a temporal stream for processing raw EEG signals and a spatial stream for analyzing MTF images. The temporal stream employs a Transformer encoder with eight attention heads and three Transformer layers, each with a dimension of 128. A positional encoding layer is added to capture sequential information. For frequency characteristics, we use two 1D convolutional layers with kernel sizes of 3 and channel dimensions of 32 and 64, respectively. The spatial stream processes MTF images through a dual-path architecture. The CNN path utilizes a ResNet-18 backbone pretrained on ImageNet, with the final fully connected layer modified to output 128-dimensional features. The ViT path divides input images into 16 × 16 patches and processes them through six Transformer layers with eight attention heads. Both paths maintain the same output dimension of 128 to facilitate feature fusion.

3.1. Dual-Stream Feature Extraction

3.1.1. Time Series Stream

The time series stream consists of parallel temporal and frequency paths. For temporal features, we employ a Transformer encoder:

F_{s}^{t} = MSA (\frac{Q_{s} K_{s}^{! ⊤}}{\sqrt{d_{k}}}) V_{s} + FFN (X_{s})

(1)

where MSA denotes Multi-head Self-Attention, a key component that allows the model to attend to different positions of the input sequence simultaneously, where (

X_{s} \in R^{T_{s} \times d_{s}}

) is the input time series. The scaling factor

\sqrt{d_{k}}

in the attention computation prevents the dot products from growing too large in magnitude, particularly for large values of

d_{k}

, which could push the softmax function into regions with extremely small gradients. Where each attention head is calculated as:

h e a d_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(2)

The Query (Q), Key (K), and Value (V) matrices are learned transformations of the input sequence

X_{s}

.

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are learnable parameter matrices for the i-th attention head, and

W^{O}

is the output projection matrix.

FFN refers to the Feed-Forward Network, a position-wise fully connected layer that processes each position independently:

FFN (x) = \max (0, x W_{1} + b_{1}) W_{2} + b_{2}

(3)

where

W_{1}

,

W_{2}

are weight matrices and

b_{1}

,

b_{2}

are bias vectors. The dimension of the internal layer (

W_{1}

) is typically set to four times the model dimension. This network allows the model to process the attended information and capture non-linear relationships.

The frequency characteristics are captured through a two-layer neural network with non-linear activation:

F_{s}^{f} = σ! (W_{2} δ (W_{1} X_{s} + b_{1}) + b_{2})

(4)

where

σ

is the sigmoid activation function,

δ

represents the ReLU activation, and

W_{1}

,

W_{2}

are learnable weight matrices that project the input into the frequency-sensitive feature space. The intermediate dimension is set to match the temporal feature dimension for balanced representation.

The temporal–frequency features are integrated via:

F_{s} = γ F_{s}^{t} + (1! -! γ) F_{s}^{f}

(5)

where

γ

is a dynamic integration coefficient that determines the relative importance of temporal and frequency features. This adaptive weighting mechanism allows the model to emphasize different aspects of the signal based on the input characteristics. where (

γ

) is learned through:

γ = σ (W_{γ} [F_{s}^{t}; F_{s}^{f}] + b_{γ})

(6)

Here,

[F_{s}^{t}; F_{s}^{f}]

represents the concatenation of temporal and frequency features, allowing the model to consider both feature types when determining their relative importance. The sigmoid activation ensures

γ \in (0, 1)

, providing a natural way to balance the contribution of each feature type.

3.1.2. Image Stream

To effectively process the visual representations of EEG signals, we employ two complementary networks. The image stream processes MTF representations through complementary pathways:

F_{i}^{l} = ResNet (X_{i}) = \sum_{l = 1}^{L} F l (X i) + S l (X i)

(7)

where (

F_{l})

and

(S_{l}

) denote residual and shortcut mappings. Global features are extracted via:

F_{i}^{g} = ViT (X_{i}) = MSA (\frac{Q_{i} K_{i}^{! ⊤}}{\sqrt{d_{k}}}) V_{i} + MLP (PatchEmbed (X_{i}))

(8)

where MLP represents a multi-layer perceptron network that processes the patch embeddings, and PatchEmbed denotes the operation that splits the input image into fixed-size patches and linearly projects them into the embedding space.

The multi-scale image features are fused through:

F_{i} = η F_{i}^{l} + (1! -! η) F_{i}^{g}

(9)

where

η

is computed similarly to

γ

(

η \in (0, 1)

), serving as a dynamic balancing factor between local and global image features. This combination of local and global feature extraction ensures that both fine-grained details and overall patterns in the MTF representations are captured.

3.2. Synergistic Modal Integration

The Synergistic Modal Integration (SYNI shows in Figure 2) module implements bidirectional cross-attention between modalities. For each modality, queries are projected from its own features, while keys and values come from the other modality. The attention weights are computed with temperature scaling (

τ = 2.0

) and dropout (

p = 0.1

). The final fusion layer uses gated integration with learned parameters to adaptively combine information from both modalities. This integration strategy serves three key purposes: (1) ensuring that important information from both modalities is preserved; (2) allowing dynamic adjustment of feature importance based on input characteristics; (3) creating a unified representation that captures complementary aspects of both modalities.

SYNI dynamically fuses features while preserving modality-specific information through three mechanisms.

3.2.1. Cross-Modal Feature Enhancement

Bidirectional attention maps are computed as:

A_{s \to i} = softmax (\frac{Q_{s} K_{i}^{! ⊤}}{\sqrt{d_{k}}})

(10)

A_{i \to s} = softmax (\frac{Q_{i} K_{s}^{! ⊤}}{\sqrt{d_{k}}})

(11)

Enhanced features are obtained through:

E_{s \to i} = A_{s \to i} V_{i}, E_{i \to s} = A_{i \to s} V_{s}

(12)

where

E_{s \to i}

and

E_{i \to s}

represent enhanced features, with

E_{s \to i}

being spatial features enhanced by temporal information and

E_{i \to s}

being temporal features enhanced by spatial information through cross-modal attention mechanisms.

3.2.2. Dynamic Feature Calibration

Integration weights are learned via:

g = σ! (W_{g} [E_{s \to i}; E_{i \to s}] + b_{g})

(13)

where

W_{g}

and

b_{g}

are learnable weight matrices and bias terms, respectively, for feature calibration.

3.2.3. Adaptive Feature Fusion

The final integrated representation is computed as:

F_{syni} = g, ⊙, E_{s \to i} + (1! -! g), ⊙, E_{i \to s}

(14)

To capture higher-order interactions, we employ a hierarchical fusion strategy:

F_{final} = MLP! (ϕ! ([F_{s}; F_{i}; F_{syni}]))

(15)

where ⊙ denotes the Hadamard product (element-wise multiplication),

ϕ

represents a learnable non-linear transformation for feature space mapping, and

F_{f i n a l}

integrates information from temporal features

F_{s}

, spatial features

F_{i}

, and cross-modal features

F_{s y n i}

through an MLP layer.

Through this sophisticated integration mechanism, our model effectively captures both modality-specific patterns and cross-modal dependencies, enabling more accurate and robust seizure prediction.

4. Experiments

4.1. Experimental Settings

4.1.1. Dataset Description

We conducted extensive evaluations on two widely used public EEG datasets: the Bonn dataset and the CHB-MIT dataset.

The Bonn dataset contains EEG recordings collected at the University of Bonn, consisting of five sets (labeled Z, O, N, F, and S) with 100 single-channel EEG segments each. Each segment has a duration of 23.6 s, sampled at 173.61 Hz, resulting in 4097 data points per segment. Sets Z and O contain surface EEG recordings from five healthy volunteers in awake states with eyes open and closed, respectively. Sets N and F include intracranial recordings from five patients during seizure-free intervals, with Set N recorded from the hippocampal formation and Set F from the epileptogenic zone. Set S consists of seizure activity recordings from all recording sites exhibiting ictal activity. The EEG signals were recorded using a 128-channel amplifier system with a common average reference.

The CHB-MIT dataset comprises long-term EEG recordings from 22 pediatric subjects with intractable seizures at Boston Children’s Hospital. The recordings include 23 cases with 664 continuous EEG recordings, each approximately one hour in duration, sampled at 256 Hz using the international 10–20 electrode placement system. The dataset contains 198 seizures with varied durations, marked by experienced epileptologists. We preprocessed the data using a sliding window approach with a window size of 40 samples and 50% overlap between consecutive windows.

For both datasets, we implemented a rigorous data split strategy:

Training set: 80% of the data, used for model training and parameter optimization;
Validation set: 10% of the data, used for hyperparameter tuning and early stopping;
Test set: 10% of the data, used exclusively for final performance evaluation.

To ensure patient-independent evaluation, we performed stratified splitting at the patient level rather than the segment level, ensuring that segments from the same patient do not appear in different sets. This approach provides a more realistic assessment of the model’s generalization capabilities.

4.1.2. Implementation Details

All experiments were implemented using the PyTorch 2.1.0 framework and conducted on an NVIDIA RTX 4090 GPU ((Nvidia, Santa Clara, CA, USA)). For the teacher model (dual-modality), we set the feature dimension d_model to 128 and used 8 attention heads in the Transformer encoder. The EEG data were processed using a sliding window of 40 time steps, and the images were resized to 224 × 224 pixels. We employed the Adam optimizer with an initial learning rate of 1 × 10⁻⁴ and a batch size of 32. The models were trained for 20 epochs with an early stopping patience of 5 epochs. For knowledge distillation, we set the temperature parameter to 2.0 and the distillation loss weight

α

to 0.5.

For the frequency analysis branch, we utilize two 1D convolutional layers with kernel sizes of 3 and stride of 1. The first layer expands the channel dimension from 1 to 32, while the second layer further increases it to 64. Each convolutional layer is followed by batch normalization and ReLU activation. The output is then adaptively pooled to match the Transformer branch’s dimension. The image processing stream implements a modified ResNet-18 architecture pre-trained on ImageNet. We remove the final classification layer and add a custom projection layer that reduces the 512-dimensional features to 128 dimensions. The Vision Transformer path processes 16 × 16 pixel patches with a Transformer encoder containing six layers and eight attention heads. The embedding dimension is maintained at 128 throughout the network to facilitate feature fusion. The SYNI module utilizes scaled dot-product attention with a temperature parameter of 2.0. The fusion weights are learned through a two-layer neural network with hidden dimension 256 and ReLU activation. All attention computations employ a dropout rate of 0.1 during training. The final MLP classifier consists of two layers with dimensions [128, 64, 5], where five corresponds to the number of classes.

During training, we employ the AdamW optimizer with an initial learning rate of 1 × 10⁻⁴ and weight decay of 1 × 10⁻⁴. The learning rate is reduced by a factor of 0.5 when validation performance plateaus for 50 epochs. We use a batch size of 32 and train for a maximum of 200 epochs with an early stopping patience of 20 epochs. All experiments were conducted on an NVIDIA RTX 4090 GPU (Nvidia, Santa Clara, CA, USA) with 24 GB memory. The complete training process typically requires approximately 14 h.

4.2. Ablation Studies

To validate the effectiveness of our proposed approach, we conducted comprehensive ablation studies focusing on three key aspects. First, we evaluated the performance of different modalities: EEG-only using a Transformer encoder, image-only using Markov transition field images, and the proposed dual-modality approach. Second, we compared various fusion strategies, including simple concatenation, cross-attention mechanism, and gated fusion. Third, we analyzed the impact of knowledge distillation by comparing the student model performance with and without distillation under different temperature values.

4.3. Cross-Dataset Validation

To evaluate the generalization capability of our model, we performed cross-dataset validation experiments between the Bonn and CHB-MIT datasets. The experiments included training on one dataset and testing on another, as well as fine-tuning experiments on the target dataset. The results demonstrate our model’s robust performance across different data distributions and recording conditions.

4.4. Computational Efficiency

Our lightweight student model achieves a significant reduction in computational resources while maintaining competitive performance. The model size was reduced by 60% compared to the teacher model, with only a marginal decrease in accuracy (less than 2%). The inference time on a single CPU core averaged 0.1 s per sample, making it suitable for real-time applications on resource-constrained devices.

4.5. Comparison with State-of-the-Art Methods

We compared our approach with several State-of-the-Art (SOTA) methods, including traditional machine learning approaches, deep learning-based methods, Transformer-based approaches, and existing multi-modal methods. The comparison focused on classification accuracy, computational efficiency, and model interpretability. The results demonstrate the superiority of our approach in achieving a better balance between performance and resource utilization.

5. Results and Discussion

Before presenting our results, we first define the key performance metrics used in our evaluation. Accuracy represents the overall proportion of correct predictions across all classes, which can be formally expressed as:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(16)

where TPs, TNs, FPs, and FNs denote True Positives, True Negatives, False Positives, and False Negatives, respectively. Sensitivity, also known as recall, indicates the model’s ability to correctly identify actual seizure events and is defined as:

Sensitivity = \frac{TP}{TP + FN}

(17)

This metric is particularly crucial in clinical settings, where missing a seizure event could have serious consequences. Specificity measures the model’s ability to correctly identify non-seizure periods:

Specificity = \frac{TN}{TN + FP}

(18)

High specificity helps minimize false alarms that could lead to unnecessary interventions. The Area Under the Receiver Operating Characteristic Curve (AUC) evaluates the model’s discrimination ability across different classification thresholds:

AUC = \int_{0}^{1} TPR (t) d (FPR (t))

(19)

where TPR is the True Positive Rate and FPR is the False Positive Rate at threshold t. AUC provides a threshold-independent performance measure that is particularly useful for imbalanced datasets.

5.1. Performance Analysis of Proposed Method

Table 1 presents the comprehensive performance comparison of our proposed approach against existing state-of-the-art methods on both the CHB-MIT and Bonn datasets. Our method demonstrates exceptional performance across all metrics, achieving 98.7% accuracy on CHB-MIT and 99.2% accuracy on the Bonn dataset.

The consistent superior performance across both datasets highlights the robustness of our dual-modality approach. Notably, our method achieves higher specificity (99.0% and 99.4% on CHB-MIT and Bonn, respectively) compared to sensitivity metrics, indicating its strong capability in reducing false positives—a crucial factor in clinical applications. The high AUC values (98.8% and 99.3%) further confirm the model’s excellent discrimination ability across different operating thresholds.

Our comprehensive performance analysis combines quantitative metrics with statistical validation and interpretability analysis. The confusion matrices in Figure 3 reveal several important patterns in our model’s classification behavior. For Class 1, Class 2, and Class 3, the model achieves the highest accuracy (99.8%), demonstrating its reliability in identifying true seizure events. False-positive rates were less than 0.4% for all classes, which is particularly important for clinical applications because they minimize unnecessary interventions. Using the normal distribution approximation, the confidence interval for our 99% accuracy is calculated as [0.9961, 0.9979].

Analysis of feature correlations (Figure 4) reveals strong complementarity between EEG and MTF representations. The diagonal patterns in the correlation matrices indicate that each modality captures unique aspects of the underlying neural activity, while the cross-correlation patterns demonstrate successful information integration through our SYNI mechanism.

For clinical applicability, we conducted latency analysis showing that our model achieves an average inference time of 0.022 s per sample on standard hardware (NVIDIA RTX 4090), well within the requirements for real-time monitoring applications. Memory utilization remains stable at 1.7 GB during inference, making the system suitable for deployment in resource-constrained clinical settings.

5.2. Ablation Study Analysis

The effectiveness of each component in our proposed architecture is validated through comprehensive ablation studies, as shown in Table 2. The base model using only EEG signals achieves 93.5% accuracy, comparable to existing single-modality approaches. The introduction of time series imaging significantly improves performance to 95.7%, demonstrating the value of our complementary visual representation strategy. The dual modality architecture further enhances accuracy to 97.2%, while the addition of the cross-attention mechanism yields the best performance of 98.7%. The ablation results also validate our knowledge distillation strategy. The student model without knowledge distillation achieves 94% accuracy, while the distilled version reaches 96.8%, demonstrating successful knowledge transfer from the teacher model while maintaining computational efficiency. This represents only a 1.9% accuracy drop from the teacher model while reducing the model size by 60%, making it highly suitable for deployment in resource-constrained environments.

Feature visualization through UMAP dimensionality reduction (Figure 5) provides insights into the model’s decision-making process. The clear separation between different seizure states in the feature space, which is particularly evident in the right portion of the plot, indicates that our dual-modality approach successfully learns discriminative representations.

5.3. Cross-Dataset Validation

The performance variation between the Bonn and CHB-MIT datasets (99.7% vs. 98.7% accuracy) can be attributed to several fundamental differences in data characteristics and classification complexity. Our detailed analysis reveals three key factors contributing to this disparity: First, the intrinsic separability of classes differs significantly between datasets. The Bonn dataset exhibits strong class separation, particularly for Seizure (S) versus non-seizure states, with an average inter-class distance of 2.68 (Figure 6). The seizure class shows consistently high separation values (ranging from 2.60 to 2.73) from all other classes. In contrast, the CHB-MIT dataset’s binary classification task shows a more modest separability measure of 0.41 between ictal and interictal states (Figure 7), indicating inherently more challenging classification boundaries.

Second, feature space analysis through dimensionality reduction techniques reveals distinct clustering patterns. As illustrated in Figure 8, the Bonn dataset’s five-class structure shows well-defined clusters, particularly in the Seizure class (S), which forms a distinct region in both PCA (explaining 58.52% of variance) and t-SNE spaces. Figure 9 shows that the CHB-MIT dataset exhibits more complex feature distributions with significant overlap between classes, as evidenced by the scattered distribution in both dimensional reduction techniques.

Third, the examination of individual feature distributions demonstrates more pronounced discriminative characteristics in the Bonn dataset. As shown in Figure 10, key features such as the Signal-to-Noise Ratio (SNR) and spectral entropy show markedly different distributions across classes, particularly for the seizure class. Figure 11 reveals that the CHB-MIT dataset displays more subtle differences in feature distributions between ictal and interictal states, with considerable overlap in all measured characteristics. This is particularly evident in the total power and dominant frequency features, where the distributions show significant overlap between states.

The higher accuracy achieved on the Bonn dataset (99.7%) can thus be attributed to its more distinct class separability and clearer feature differentiation. The slightly lower performance on CHB-MIT (98.7%) reflects the greater complexity of distinguishing between ictal and interictal states in continuous, long-term EEG recordings, where state transitions are more gradual and feature boundaries are less distinct. Despite these challenges, our model maintains robust performance across both datasets, demonstrating its effectiveness in handling varying levels of classification complexity.

5.4. Clinical Implications and Practical Considerations

Our method’s success in eliminating complex preprocessing steps represents a significant advancement toward practical clinical applications.

Traditional approaches often rely heavily on sophisticated signal processing techniques and manual feature engineering, which can be both time-consuming and expertise-dependent. In contrast, our approach directly learns from raw EEG signals and their time series images, significantly reducing the preprocessing overhead while achieving superior performance.

The successful knowledge distillation into a lightweight model addresses a critical need in clinical settings. The student model, while maintaining 96.8% accuracy, requires only 40% of the original computational resources, making it suitable for deployment on portable devices and integration with existing clinical systems. This efficiency is particularly valuable for continuous monitoring applications, where real-time processing and battery life are crucial considerations.

5.5. Advantages over Existing Methods

The comprehensive experimental results demonstrate several key advantages of our approach: The dual-modality architecture effectively leverages complementary information from both EEG signals and their image representations, leading to more robust feature learning compared to single-modality approaches. This is evidenced by the consistent performance improvements across different datasets and evaluation metrics. Furthermore, our cross-attention mechanism successfully captures complex interactions between temporal and spatial features, as demonstrated by the ablation studies. The significant performance gain (1.5%) from adding cross-attention highlights its importance in effective multi-modal fusion.

In addition, the knowledge distillation model successfully addresses the practical limitations of deploying complex models in clinical settings. The Student model’s performance suggests that most of the discriminative power of the dual-modality approach can be effectively compressed into a lightweight, single-modality model.

5.6. Future Directions

While our current results are promising, several directions for future research emerge from this study. The strong cross-dataset performance suggests the potential for transfer learning applications, where models pre-trained on large datasets could be efficiently fine-tuned for specific clinical settings.

Additionally, the success of our time series imaging approach opens possibilities for exploring other signal-to-image transformation techniques that might capture different aspects of EEG signals. The effectiveness of our knowledge distillation model also suggests opportunities for further model compression and optimization. Future work could explore more sophisticated distillation strategies or alternative lightweight architectures that might achieve even better trade-offs between accuracy and computational efficiency.

These results collectively demonstrate that our proposed method not only advances the state-of-the-art method in EEG-based epilepsy detection but also provides a practical solution for clinical deployment. The combination of high accuracy, robust generalization, and computational efficiency makes our approach a promising foundation for next-generation epilepsy monitoring systems.

6. Limitations and Future Work

6.1. Current Limitations

Our study presents several important limitations that warrant acknowledgment and guide future research directions. The primary limitation relates to dataset representation, as our experiments primarily rely on CHB-MIT and Bonn datasets. While these are widely accepted benchmarks, they may not fully capture the diversity of epileptic patterns across different populations, age groups, and clinical conditions. Furthermore, these datasets were collected in controlled clinical environments, potentially limiting our model’s generalizability to real-world scenarios with varying noise conditions and recording qualities.

From a computational perspective, although our model demonstrates improved efficiency compared to existing approaches, the processing of multimodal data in real-time still presents challenges for deployment on resource-constrained devices. The current architecture requires dedicated GPU resources for optimal performance, which may limit its accessibility in some clinical settings. This computational demand becomes particularly relevant when considering widespread deployment in healthcare facilities with varying technological infrastructure.

The temporal aspects of seizure prediction present another significant limitation. Our current implementation focuses on short-term prediction windows, typically within 30 min before seizure onset. Extending the prediction horizon while maintaining high accuracy remains challenging due to the increased variability and complexity of long-term EEG patterns. This limitation affects the model’s practical utility in providing earlier warnings to patients and healthcare providers.

Model interpretability remains a crucial challenge despite our attention mechanism providing some level of insight. The complex nature of our deep learning architecture presents difficulties in providing clear, clinically meaningful explanations for its predictions. This limitation is particularly significant in healthcare applications, where understanding the reasoning behind predictions is essential for clinical decision-making and patient trust.

6.2. Future Research Directions

Looking forward, several promising research directions emerge from our current limitations. A primary focus should be the enhancement of data integration capabilities. Future research should explore the incorporation of additional physiological signals and contextual information, including environmental factors, patient-specific metadata, and circadian rhythm patterns. This expanded data integration would provide a more comprehensive understanding of seizure precursors and potentially improve prediction accuracy.

Model optimization represents another critical area for future work. Research efforts should focus on developing more efficient attention mechanisms specifically designed for EEG signal processing and investigating lightweight model architectures suitable for edge device deployment. These optimizations would address the current computational limitations while maintaining or improving prediction performance.

Clinical integration presents both challenges and opportunities for future research. The development of real-time monitoring systems that seamlessly integrate our prediction model into existing healthcare workflows requires careful consideration of user interface design, alert systems, and clinical validation across diverse patient populations. This integration must balance technical capabilities with practical clinical requirements and constraints. For clinical applicability, we conducted comprehensive latency and memory analysis across different numerical precision settings. With FP32 precision, our model achieves an average inference time of 0.022 s per sample on standard hardware (NVIDIA RTX 4090), with memory utilization stable at 1.7 GB during inference. When quantized to INT8 precision, the memory utilization would reduce to approximately 0.43 GB (a 75% reduction), while INT4 quantization would further reduce the memory footprint to around 0.21 GB (an 88% reduction). These substantial memory reductions through quantization, combined with the model’s efficient inference time, make the system particularly suitable for deployment in resource-constrained clinical settings, including potential edge devices for continuous monitoring.

The extension of prediction capabilities represents a significant opportunity for advancement. Future research should focus on extending the prediction horizon while maintaining high accuracy, potentially through the development of hierarchical prediction models that combine short-term and long-term forecasting capabilities. Additionally, the incorporation of seizure severity prediction alongside occurrence prediction would enhance the clinical utility of the system.

Privacy and security considerations will continue to be crucial as these systems evolve. Future research must address the development of privacy-preserving training methods for sensitive medical data and secure data sharing protocols for collaborative model training. These advances must maintain compliance with healthcare regulations while enabling effective model development and deployment.

The long-term vision for this research is the development of a comprehensive epilepsy management system that provides accurate, real-time seizure prediction while operating efficiently on portable devices and integrating seamlessly with existing healthcare systems. This system should maintain patient privacy and data security while adapting to individual patient characteristics and needs. Achieving this vision will require continued collaboration between computer science researchers, medical professionals, and healthcare technology developers, working together to translate theoretical advances into practical clinical solutions.

7. Conclusions

This paper proposes a novel dual-modality deep learning model for epileptic seizure prediction that effectively integrates time series imaging with Transformer-based architecture. Through comprehensive evaluations on both CHB-MIT and Bonn datasets, our method demonstrates superior performance, achieving 98.7% and 99.2% accuracy, respectively. The strong cross-dataset validation results, maintaining accuracy above 96%, confirm the robust generalization capability of our approach. The proposed attention mechanism successfully balances prediction accuracy with computational efficiency, making it suitable for practical clinical applications. While there remain opportunities for further improvement, particularly in long-term prediction horizons, our work establishes a solid foundation for accurate and reliable seizure prediction. This advancement represents a significant step toward improving the quality of life for individuals with epilepsy through early seizure detection and intervention.

Author Contributions

Conceptualization, J.Q. and F.L.; methodology, J.Q.; software, J.Q.; validation, J.Q., J.Z. and Z.L.; formal analysis, J.Q.; investigation, J.Q., J.Z. and Z.L.; resources, J.Z. and Z.L.; data curation, J.Q.; writing—original draft preparation, J.Q., J.Z. and Z.L.; writing—review and editing, J.Q., J.Z., Z.L. and F.L.; visualization, J.Q., J.Z. and Z.L.; supervision, F.L.; project administration, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All code can be accessed from GitHub: [https://github.com/D-ST-Sword/Dual-Modality-Transformer-with-Time-Series-Imaging-for-Robust-Epileptic-Seizure-Prediction] (accessed on 27 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADML (Adaptive Dual-Modal Learning)	Novel approach combining different types of analysis methods to improve seizure prediction
CHB-MIT Dataset	A publicly available collection of EEG recordings from pediatric subjects with epilepsy, collected at Boston Children’s Hospital
EEG (Electroencephalography)	A method of recording electrical activity of the brain using sensors placed on the scalp
MTF (Markov Transition Field)	A technique for converting time-based signals into image representations
SYNI (Synergistic Modal Integration)	Our proposed method for combining different types of information from EEG signals
Time Series Imaging	The process of converting time-based signals (like EEG) into visual representations for analysis
Transformer	A type of artificial intelligence model that can learn patterns in sequential data
ViT (Vision Transformer)	A specialized type of Transformer model designed to analyze image data

References

Asadi-Pooya, A.A.; Brigo, F.; Lattanzi, S.; Blumcke, I. Adult epilepsy. Lancet 2023, 402, 412–424. [Google Scholar] [CrossRef] [PubMed]
Smith, P.E. Initial Management of Seizure in Adults. N. Engl. J. Med. 2021, 385, 251–263. [Google Scholar] [CrossRef] [PubMed]
Devinsky, O.; Vezzani, A.; O’Brien, T.J.; Jette, N.; Scheffer, I.E.; de Curtis, M.; Perucca, P. Epilepsy. Nat. Rev. Dis. Prim. 2018, 4, 18025. [Google Scholar] [CrossRef] [PubMed]
Ding, D.; Zhou, D.; Sander, J.W.; Wang, W.; Li, S.; Hong, Z. Epilepsy in China: Major progress in the past two decades. Lancet Neurol. 2021, 20, 316–326. [Google Scholar] [CrossRef]
Pellinen, J.; Foster, E.C.; Wilmshurst, J.M.; Zuberi, S.M.; French, J. Improving epilepsy diagnosis across the lifespan: Approaches and innovations. Lancet Neurol. 2024, 23, 511–521. [Google Scholar] [CrossRef]
Tasci, I.; Tasci, B.; Barua, P.D.; Dogan, S.; Tuncer, T.; Palmer, E.E.; Fujita, H.; Acharya, U.R. Epilepsy detection in 121 patient populations using hypercube pattern from EEG signals. Inf. Fusion 2023, 96, 252–268. [Google Scholar] [CrossRef]
Xin, Q.; Hu, S.; Liu, S.; Zhao, L.; Zhang, Y.D. An Attention-Based Wavelet Convolution Neural Network for Epilepsy EEG Classification. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 30, 957–966. [Google Scholar] [CrossRef]
Ahmad, I.; Wang, X.; Javeed, D.; Kumar, P.; Samuel, O.W.; Chen, S. A Hybrid Deep Learning Approach for Epileptic Seizure Detection in EEG signals. IEEE J. Biomed. Health Inform. 2023, 1–12. [Google Scholar] [CrossRef]
He, S.; Li, Y.; Le, X.; Han, X.; Lin, J.; Peng, X.; Li, M.; Yang, R.; Yao, D.; Valdes-Sosa, P.A.; et al. Assessment of Multivariate Information Transmission in Space-Time-Frequency Domain: A Case Study for EEG Signals. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 1764–1775. [Google Scholar] [CrossRef]
Boonyakitanont, P.; Lek-uthai, A.; Chomtho, K.; Songsiri, J. A review of feature extraction and performance evaluation in epileptic seizure detection using EEG. Biomed. Signal Process. Control 2020, 57, 101702. [Google Scholar] [CrossRef]
Cura, O.K.; Ozdemir, M.A.; Akan, A. Epileptic EEG Classification Using Synchrosqueezing Transform with Machine and Deep Learning Techniques. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–21 January 2021; pp. 1210–1214. [Google Scholar] [CrossRef]
Qiu, X.; Yan, F.; Liu, H. A difference attention ResNet-LSTM network for epileptic seizure detection using EEG signal. Biomed. Signal Process. Control 2023, 83, 104652. [Google Scholar] [CrossRef]
Mumtaz, W.; Rasheed, S.; Irfan, A. Review of challenges associated with the EEG artifact removal methods. Biomed. Signal Process. Control 2021, 68, 102741. [Google Scholar] [CrossRef]
Manarikkal, I.; Elasha, F.; Mba, D. Diagnostics and prognostics of planetary gearbox using CWT, auto regression (AR) and K-means algorithm. Appl. Acoust. 2021, 184, 108314. [Google Scholar] [CrossRef]
Reichert, R.; Kaifler, N.; Kaifler, B. Limitations in wavelet analysis of non-stationary atmospheric gravity wave signatures in temperature profiles. Atmos. Meas. Tech. 2024, 17, 4659–4673. [Google Scholar] [CrossRef]
Konar, P.; Saha, M.; Sil, J.; Chattopadhyay, P. Fault diagnosis of induction motor using CWT and rough-set theory. In Proceedings of the 2013 IEEE Symposium on Computational Intelligence in Control and Automation (CICA). IEEE, Singapore, 16–19 April 2013; pp. 17–23. [Google Scholar]
Hu, Y.; Li, F.; Li, H.; Liu, C. An enhanced empirical wavelet transform for noisy and non-stationary signal processing. Digit. Signal Process. 2017, 60, 220–229. [Google Scholar] [CrossRef]
Chuang, C.H.; Chang, K.Y.; Huang, C.S.; Jung, T.P. IC-U-Net: A U-Net-based Denoising Autoencoder Using Mixtures of Independent Components for Automatic EEG Artifact Removal. NeuroImage 2022, 263, 119586. [Google Scholar] [CrossRef]
Ranjan, R.; Sahana, B.C.; Bhandari, A.K. Motion Artifacts Suppression From EEG Signals Using an Adaptive Signal Denoising Method. IEEE Trans. Instrum. Meas. 2022, 71, 3142037. [Google Scholar] [CrossRef]
Kucukler, O.F.; Amira, A.; Malekmohamadi, H. EEG channel selection using Gramian Angular Fields and spectrograms for energy data visualization. Eng. Appl. Artif. Intell. 2024, 133, 108305. [Google Scholar] [CrossRef]
Bore, J.C.; Li, P.; Jiang, L.; Ayedh, W.M.A.; Chen, C.; Harmah, D.J.; Yao, D.; Cao, Z.; Xu, P. A Long Short-Term Memory Network for Sparse Spatiotemporal EEG Source Imaging. IEEE Trans. Med Imaging 2021, 40, 3787–3800. [Google Scholar] [CrossRef]
Shen, F.; Liu, J.; Wu, K. Multivariate Time Series Forecasting Based on Elastic Net and High-Order Fuzzy Cognitive Maps: A Case Study on Human Action Prediction Through EEG Signals. IEEE Trans. Fuzzy Syst. 2021, 29, 2336–2348. [Google Scholar] [CrossRef]
Li, X.; Kang, Y.; Li, F. Forecasting with time series imaging. Expert Syst. Appl. 2020, 160, 113680. [Google Scholar] [CrossRef]
Lu, L.; Zhiguang, W. Encoding temporal markov dynamics in graph for visualizing and mining time series. In Proceedings of the Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Jiang, J.R.; Yen, C.T. Markov Transition Field and Convolutional Long Short-Term Memory Neural Network for Manufacturing Quality Prediction. In Proceedings of the 2020 IEEE International Conference on Consumer Electronics—Taiwan (ICCE-Taiwan), Taoyuan, Taiwan, 28–30 September 2020; pp. 1–2. [Google Scholar] [CrossRef]
Wang, Z.; Oates, T. Imaging time-series to improve classification and imputation. arXiv 2015, arXiv:1506.00327. [Google Scholar]
Shankar, A.; Dandapat, S.; Barma, S. Discrimination of Types of Seizure Using Brain Rhythms Based on Markov Transition Field and Deep Learning. IEEE Open J. Instrum. Meas. 2022, 1, 3202555. [Google Scholar] [CrossRef]
Li, R.; Wu, Y.; Wu, Q.; Dey, N.; Crespo, R.G.; Shi, F. Emotion stimuli-based surface electromyography signal classification employing Markov transition field and deep neural networks. Measurement 2022, 189, 110470. [Google Scholar] [CrossRef]
Duque-Muñoz, L.; Espinosa-Oviedo, J.J.; Castellanos-Dominguez, C.G. Identification and monitoring of brain activity based on stochastic relevance analysis of short–time EEG rhythms. Biomed. Eng. Online 2014, 13, 1–20. [Google Scholar] [CrossRef]
Chen, D.; Wan, S.; Xiang, J.; Bao, F.S. A high-performance seizure detection algorithm based on Discrete Wavelet Transform (DWT) and EEG. PLoS ONE 2017, 12, e0173138. [Google Scholar] [CrossRef]
Oweis, R.J.; Abdulhay, E.W. Seizure classification in EEG signals utilizing Hilbert-Huang transform. Biomed. Eng. Online 2011, 10, 1–15. [Google Scholar] [CrossRef]
Riaz, F.; Hassan, A.; Rehman, S.; Niazi, I.K.; Dremstrup, K. EMD-based temporal and spectral features for the classification of EEG signals using supervised learning. IEEE Trans. Neural Syst. Rehabil. Eng. 2015, 24, 28–35. [Google Scholar] [CrossRef]
Jindal, K.; Upadhyay, R.; Singh, H.S. Application of tunable-Q wavelet transform based nonlinear features in epileptic seizure detection. Analog Integr. Circuits Signal Process. 2019, 100, 437–452. [Google Scholar] [CrossRef]
Hu, W.; Cao, J.; Lai, X.; Liu, J. Mean amplitude spectrum based epileptic state classification for seizure prediction using convolutional neural networks. J. Ambient Intell. Humaniz. Comput. 2023, 14, 15485–15495. [Google Scholar] [CrossRef]
Li, C.; Liang, M. A generalized synchrosqueezing transform for enhancing signal time–frequency representation. Signal Process. 2012, 92, 2264–2274. [Google Scholar] [CrossRef]
Khare, S.K.; Bajaj, V. Time–Frequency Representation and Convolutional Neural Network-Based Emotion Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 2901–2909. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Jia, J.; Zhang, R. EEG analysis of Parkinson’s disease using time–frequency analysis and deep learning. Biomed. Signal Process. Control 2022, 78, 103883. [Google Scholar] [CrossRef]
Ghaderyan, P.; Abbasi, A.; Sedaaghi, M.H. An efficient seizure prediction method using KNN-based undersampling and linear frequency measures. J. Neurosci. Methods 2014, 232, 134–142. [Google Scholar] [CrossRef]
Omidvar, M.; Zahedi, A.; Bakhshi, H. EEG signal processing for epilepsy seizure detection using 5-level Db4 discrete wavelet transform, GA-based feature selection and ANN/SVM classifiers. J. Ambient Intell. Humaniz. Comput. 2021, 12, 10395–10403. [Google Scholar] [CrossRef]
Acharya, U.R.; Molinari, F.; Sree, S.V.; Chattopadhyay, S.; Ng, K.H.; Suri, J.S. Automated diagnosis of epileptic EEG using entropies. Biomed. Signal Process. Control 2012, 7, 401–408. [Google Scholar] [CrossRef]
Ma, Y.; Huang, Z.; Su, J.; Shi, H.; Wang, D.; Jia, S.; Li, W. A multi-channel feature fusion CNN-BI-LSTM epilepsy EEG classification and prediction model based on attention mechanism. IEEE Access 2023, 11, 62855–62864. [Google Scholar] [CrossRef]
Gelenbe, E. Random neural networks with negative and positive signals and product form solution. Neural Comput. 1989, 1, 502–510. [Google Scholar] [CrossRef]
Şeker, A.; Diri, B.; Balık, H.H. A review about deep learning methods and applications. Gazi J. Eng. Sci. 2017, 3, 47–64. [Google Scholar]
Wang, X.; Wang, Y.; Liu, D.; Wang, Y.; Wang, Z. Automated recognition of epilepsy from EEG signals using a combining space–time algorithm of CNN-LSTM. Sci. Rep. 2023, 13, 14876. [Google Scholar] [CrossRef]
Najafi, T.; Jaafar, R.; Remli, R.; Wan Zaidi, W.A. A classification model of EEG signals based on rnn-lstm for diagnosing focal and generalized epilepsy. Sensors 2022, 22, 7269. [Google Scholar] [CrossRef] [PubMed]
Velasco-Gallego, C.; Lazakis, I. Analysis of Time Series Imaging Approaches for the Application of Fault Classification of Marine Systems. In Proceedings of the 32nd European Safety and Reliability Conference, Dublin, Ireland, 28 August–1 September 2022; pp. 1353–1362. [Google Scholar]
Wang, Z.; Oates, T. Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. In Proceedings of the Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
Ko, D.W.; Yang, J.J. EEG-Based schizophrenia diagnosis through time series image conversion and deep learning. Electronics 2022, 11, 2265. [Google Scholar] [CrossRef]
Zhao, D.D.; Zhao, Q. EEG-Based Cross-Subject Emotion Recognition Using GADF and CADAN. In Proceedings of the 2023 China Automation Congress (CAC), Chongqing, China, 17–19 November 2023; pp. 381–386. [Google Scholar] [CrossRef]
Shankar, A.; Khaing, H.K.; Dandapat, S.; Barma, S. Epileptic seizure classification based on Gramian angular field transformation and deep learning. In Proceedings of the 2020 IEEE Applied Signal Processing Conference (ASPCON). IEEE, Kolkata, India, 7–9 October 2020; pp. 147–151. [Google Scholar]
Xu, G.; Ren, T.; Chen, Y.; Che, W. A One-Dimensional CNN-LSTM Model for Epileptic Seizure Recognition Using EEG Signal Analysis. Front. Neurosci. 2020, 14, 578126. [Google Scholar] [CrossRef] [PubMed]
Zhu, R.; Pan, W.x.; Liu, J.x.; Shang, J.l. Epileptic seizure prediction via multidimensional transformer and recurrent neural network fusion. J. Transl. Med. 2024, 22, 895. [Google Scholar] [CrossRef]
Hu, S.; Liu, J.; Yang, R.; Wang, Y.; Wang, A.; Li, K.; Liu, W.; Yang, C. Exploring the Applicability of Transfer Learning and Feature Engineering in Epilepsy Prediction Using Hybrid Transformer Model. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 1321–1332. [Google Scholar] [CrossRef]
Deng, Z.; Li, C.; Song, R.; Liu, X.; Qian, R.; Chen, X. EEG-based seizure prediction via hybrid vision transformer and data uncertainty learning. Eng. Appl. Artif. Intell. 2023, 123, 106401. [Google Scholar] [CrossRef]
Thuwajit, P.; Rangpong, P.; Sawangjai, P.; Autthasan, P.; Chaisaen, R.; Banluesombatkul, N.; Boonchit, P.; Tatsaringkansakul, N.; Sudhawiyangkul, T.; Wilaiprasitporn, T. EEGWaveNet: Multiscale CNN-Based Spatiotemporal Feature Extraction for EEG Seizure Detection. IEEE Trans. Ind. Inform. 2022, 18, 5547–5557. [Google Scholar] [CrossRef]
Abdulwahhab, A.H.; Abdulaal, A.H.; Thary Al-Ghrairi, A.H.; Mohammed, A.A.; Valizadeh, M. Detection of epileptic seizure using EEG signals analysis based on deep learning techniques. Chaos Solitons Fractals 2024, 181, 114700. [Google Scholar] [CrossRef]

Figure 1. Overview of the Adaptive Dual-Modal Learning (ADML) framework. The temporal stream (top path) consists of two parallel branches: 1. A Transformer encoder that captures long-range temporal dependencies in the raw EEG signal through self-attention mechanisms. The input EEG signal is first projected to a higher dimensional space (128-d) and combined with positional encodings to preserve temporal order information. 2. A frequency analysis branch is implemented as a Fully Convolutional Network (FCNN) that extracts spectral characteristics through hierarchical convolutional operations. This branch applies consecutive 1D convolutions with increasing receptive fields to capture multi-scale frequency patterns. The spatial stream (bottom path) processes the MTF images through the following: A ResNet backbone that extracts hierarchical spatial features through residual learning. The network progressively increases the receptive field while maintaining fine-grained information through skip connections. A Vision Transformer (ViT) that captures global dependencies in the image space by treating the image as a sequence of patches. Each patch is linearly projected and processed through multi-head self-attention layers.

Figure 2. Synergistic modal integration.

Figure 3. Confusion matrix of test on Bonn.

Figure 4. Feature correlations of test on Bonn.

Figure 5. UMAP visualization of learned features from different modalities. The plot demonstrates clear separation between different epileptic states, indicating effective feature learning by our dual-modality approach. Each point represents a sample, and colors indicate different seizure states. The clustering pattern shows that our model successfully learns discriminative representations that can effectively distinguish between different types of seizure activities.

Figure 6. Class separability matrix for the Bonn dataset showing pairwise distances between different EEG states (Z: healthy eyes open, O: healthy eyes closed, N: interictal zone, F: epileptogenic zone, S: Seizure activity). Higher values indicate better class separation.

Figure 7. Class separability matrix for the CHB-MIT dataset showing the distance between ictal and interictal states in the binary classification task.

Figure 8. Dimensionality reduction visualization of the Bonn dataset feature space using PCA ((left), preserving 58.52% variance) and t-SNE (right), showing class clustering patterns.

Figure 9. Feature space visualization of the CHB-MIT dataset using PCA ((left), 63.98% variance explained) and t-SNE (right), illustrating the distribution of ictal and interictal states.

Figure 10. Distribution of key EEG features across different classes in the Bonn dataset, including Signal-to-Noise Ratio (SNR), spectral entropy, dominant frequency, total power, skewness, and kurtosis.

Figure 11. Distribution of EEG features between ictal and interictal states in the CHB-MIT dataset, demonstrating the feature characteristics of seizure versus non-seizure periods.

Table 1. Performance comparison on both datasets.

Dataset	Method	Accuracy (%)	Sensitivity (%)	Specificity (%)	AUC (%)
CHB-MIT	Conv1D+LSTM [51]	91.2	90.8	91.5	91.3
	Transformer [52]	93.4	92.9	93.8	93.5
	FedTransformer [53]	94.1	93.7	94.4	94.2
	HViT-DUL [54]	94.0	93.8	94.0	93.6
	TGCNN [55]	92.8	92.2	93.7	93.5
	EEGWaveNet [55]	98.3	91.2	98.0	98.6
	PCNN-RNN [56]	96.9	92.2	97.6	97.1
	ADML (Ours)	98.7	98.3	99.0	98.8
Bonn	Conv1D+LSTM [51]	92.5	91.8	93.1	92.7
	Transformer [52]	94.8	94.2	95.3	94.9
	FedTransformer [53]	95.6	95.1	96.0	95.7
	HViT-DUL [54]	94.9	94.1	94.3	95.1
	TGCNN [55]	94.5	93.3	92.0	94.9
	EEGWaveNet [55]	97.9	92.9	96.0	97.7
	PCNN-RNN [56]	99.0	95.2	96.6	98.9
	ADML (Ours)	99.7	98.9	99.4	99.3

Note: All comparison models are reproduced with reference to corresponding papers or open-source code, and all models are tested using the same computing device. Bold values indicate the best performance in each metric.

Table 2. Ablation study results on CHB-MIT.

Model Variant	Accuracy (%)	Sensitivity (%)	AUC (%)
ADML (EEG time-series only)	93.5	92.8	93.4
ADML (MTF image only)	92.9	92.6	92.2
ADML (Dual-modal)	98.7	98.3	98.8
ADML (Cross-attention fusion)	97.2	96.8	97.1
ADML (Gated fusion)	95.7	95.1	95.6
Student Model (EEG time-series)	94.2	93.8	94.1
Student Model (MTF image)	96.8	96.5	96.7

Note: Bold values indicate the best performance achieved for each evaluation metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, J.; Liu, Z.; Zhuang, J.; Liu, F. Dual-Modality Transformer with Time Series Imaging for Robust Epileptic Seizure Prediction. Appl. Sci. 2025, 15, 1538. https://doi.org/10.3390/app15031538

AMA Style

Qin J, Liu Z, Zhuang J, Liu F. Dual-Modality Transformer with Time Series Imaging for Robust Epileptic Seizure Prediction. Applied Sciences. 2025; 15(3):1538. https://doi.org/10.3390/app15031538

Chicago/Turabian Style

Qin, Jiahao, Zijia Liu, Jihong Zhuang, and Feng Liu. 2025. "Dual-Modality Transformer with Time Series Imaging for Robust Epileptic Seizure Prediction" Applied Sciences 15, no. 3: 1538. https://doi.org/10.3390/app15031538

APA Style

Qin, J., Liu, Z., Zhuang, J., & Liu, F. (2025). Dual-Modality Transformer with Time Series Imaging for Robust Epileptic Seizure Prediction. Applied Sciences, 15(3), 1538. https://doi.org/10.3390/app15031538

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Dual-Modality Transformer with Time Series Imaging for Robust Epileptic Seizure Prediction

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Dual-Stream Feature Extraction

3.1.1. Time Series Stream

3.1.2. Image Stream

3.2. Synergistic Modal Integration

3.2.1. Cross-Modal Feature Enhancement

3.2.2. Dynamic Feature Calibration

3.2.3. Adaptive Feature Fusion

4. Experiments

4.1. Experimental Settings

4.1.1. Dataset Description

4.1.2. Implementation Details

4.2. Ablation Studies

4.3. Cross-Dataset Validation

4.4. Computational Efficiency

4.5. Comparison with State-of-the-Art Methods

5. Results and Discussion

5.1. Performance Analysis of Proposed Method

5.2. Ablation Study Analysis

5.3. Cross-Dataset Validation

5.4. Clinical Implications and Practical Considerations

5.5. Advantages over Existing Methods

5.6. Future Directions

6. Limitations and Future Work

6.1. Current Limitations

6.2. Future Research Directions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI