Research on Personalized Course Resource Recommendation Method Based on GEMRec

Wang, Enliang; Sun, Zhixin

doi:10.3390/app15031075

Open AccessArticle

Research on Personalized Course Resource Recommendation Method Based on GEMRec

by

Enliang Wang

^1,2,3

and

Zhixin Sun

^1,2,3,*

¹

Post Big Data Technology and Application Engineering Research Center of Jiangsu Province, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

²

Post Industry Technology Research and Development Center of the State Posts Bureau (Internet of Things Technology), Nanjing University of Posts and Telecommunications, Nanjing 210003, China

³

Key Laboratory of Broadband Wireless Communication and Sensor Network Technology, Ministry of Education, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1075; https://doi.org/10.3390/app15031075

Submission received: 19 December 2024 / Revised: 19 January 2025 / Accepted: 20 January 2025 / Published: 22 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid growth of online educational resources, existing personalized course recommendation systems face challenges in multimodal feature integration and limited recommendation interpretability when dealing with complex and diverse instructional content. This paper proposes a graph-enhanced multimodal recommendation method (GEMRec), which effectively integrates text, video, and audio features through a graph attention network and differentiable pooling. Innovatively, GEMRec introduces graph edit distance into the recommendation system to measure the structural similarity between a learner’s knowledge state and course content at the knowledge graph level. Additionally, it combines SHAP (SHapley Additive exPlanations) value computation with large language models to generate reliable and personalized recommendation explanations. Experiments on the MOOCCubeX dataset demonstrate that the GEMRec model exhibits strong convergence and generalization during training. Compared with existing methods, GEMRec achieves 0.267, 0.265, and 0.297 on the Precision@10, Recall@10, and NDCG@10 metrics, respectively, significantly outperforming traditional collaborative filtering and other deep learning models. These results validate the effectiveness of multimodal feature integration and knowledge graph enhancement in improving recommendation performance.

Keywords:

multimodal deep learning; knowledge graph; personalized course recommendation; graph neural network; interpretability

1. Introduction

With the advancement of online education and digital learning, course resources have grown exponentially, leading to information overload that hinders learners from efficiently identifying suitable content, thus impacting learning outcomes and motivation. Therefore, precisely recommending relevant learning content has become a critical challenge in educational technology. Traditional recommendation methods, such as collaborative filtering and content-based recommendation systems, reveal distinct limitations when dealing with high-dimensional sparse data and complex nonlinear relationships. Collaborative filtering is prone to issues of cold start and data sparsity, while content-based methods are limited in capturing users’ underlying interests. Hybrid recommendation approaches have made some improvements, yet they remain insufficient in handling multimodal data.

In recent years, deep learning has brought significant breakthroughs to recommendation systems. Neural collaborative filtering and attention mechanism-based models have improved the ability to capture complex nonlinear interactions through deep feature extraction. However, single-modality recommendation systems struggle to effectively integrate multimodal data, such as text, video, and audio, resulting in inadequate cross-modal information fusion and user interest understanding. Multimodal deep learning, through the integration of various feature types, provides a more comprehensive understanding of course content, thereby enhancing recommendation effectiveness. Nonetheless, existing multimodal recommendation systems still face challenges in modality feature fusion and weight allocation, and they lack interpretability, which limits system transparency and user trust.

The application of knowledge graphs in recommendation systems offers unique advantages, as they enable structured pathways that link users to learning resources, aiding in understanding the knowledge structure of courses. However, integrating knowledge graphs with deep learning to enhance a system’s ability to understand course content, capture knowledge relationships, and provide interpretable recommendations remains a pressing challenge.

To address these challenges, this paper proposes a multimodal deep learning recommendation framework, GEMRec, enhanced by heterogeneous knowledge graphs. This approach includes the following theoretical innovations: (1) A cross-modal feature fusion mechanism based on a graph attention network, which employs differentiable pooling and multi-head attention to achieve hierarchical representation learning and dynamic weight allocation for heterogeneous data such as text, video, and audio. (2) A bidirectional enhancement learning paradigm that combines knowledge graphs and graph neural networks, leveraging graph edit distance-based similarity calculations and meta-path-based sequential representations to capture the intrinsic associations within course knowledge structures. (3) A graph-semantic-enhanced interpretability framework based on SHAP (SHapley Additive exPlanations) values and large language models, which quantifies feature importance and generates dynamic templates to provide interpretable recommendations. Experimental results demonstrate significant improvements in recommendation accuracy, knowledge representation, and system transparency. This research offers a novel theoretical paradigm and technical approach to addressing the challenges of multimodal feature fusion, knowledge representation learning, and interpretability in educational recommendation systems.

2. Related Work

2.1. Recommendation Models

In the field of recommendation systems, traditional methods include collaborative filtering, content-based recommendations, and hybrid recommendations. Collaborative filtering [1], as one of the most common recommendation algorithms, relies on analyzing relationships between users and items, recommending based on similar users’ behaviors. However, this method is limited by cold-start and data sparsity issues. Content-based recommendation systems [2] primarily recommend similar items by analyzing item features, but they struggle to capture users’ latent interests. Although Kristian Wahyudi et al. [3] improved recommendation systems’ content understanding in specific domains by incorporating multiple classifications and scoring mechanisms, limitations remain in broader applications. To overcome these limitations, hybrid recommendation systems [4] combine collaborative filtering with content-based methods to enhance recommendation accuracy. For instance, Venkata et al. [5] developed a hybrid recommendation system capable of dynamically adapting to changing user preferences, suitable for online learning platforms, while Sunny et al. [6] addressed cold-start and data sparsity issues through hybrid filtering algorithms.

In recent years, deep learning has achieved significant advances in recommendation systems, particularly in neural collaborative filtering (NCF), attention mechanisms, and sequential recommendation models. NCF enhances recommendation accuracy by mining complex nonlinear interactions between users and items through deep neural networks. Liu et al. [7] improved a music recommendation system by combining traditional collaborative filtering with NCF, significantly boosting recommendation accuracy. Furthermore, the application of attention mechanisms in recommendation systems has gained widespread recognition, as it dynamically adjusts weights between users and items to improve model precision. Hekmatfar et al. [8] proposed an attention model based on graph neural networks (GNNs) that effectively captures complex user–item relationships, while Zhao et al. [9] introduced a variational self-attention mechanism for a new sequential recommendation model, addressing the issue of uncertainty in user preferences. Sequential recommendation models focus on the dynamic evolution of user interests; for instance, Xu et al. [10] combined long-term and short-term preferences in their Long-Short Term Self-Attention Network (LSSA), significantly improving recommendation accuracy. Li et al. [11] proposed the Time Interval Aware Self-Attention Model (TiSASRec), further enhancing the real-time accuracy of recommendations.

Although hybrid recommendation and deep learning methods have improved recommendation performance to some extent, the limitation of single-modality methods remains insufficient to fully capture users’ multidimensional interests and needs. Thus, multimodal learning presents a promising approach to improving recommendation systems’ performance in complex scenarios.

2.2. Multimodal Learning

Multimodal data (such as text, images, and audio) provide systems with rich features, and feature fusion is a core task in multimodal learning, aiming to effectively integrate diverse modal data to acquire comprehensive feature representations. Zhang [12] proposed a multimodal pretraining framework that combines user behavioral sequences with item multimodal content via contrastive learning, significantly improving recommendation system performance in cold-start and cross-domain recommendations. Malitesta [13] developed a unified framework, Ducho, to standardize multimodal feature extraction methods and streamline the integration process.

Cross-modal learning focuses on enabling knowledge transfer and representation fusion across different modalities. Liu [14] proposed a decoupled multimodal representation learning model to address weight allocation issues among different modalities, effectively capturing user preferences and enhancing recommendation performance. Wang [15] developed a model that enhances recommendation accuracy and interpretability by learning complementary and common information across modalities.

Multimodal recommendations provide more precise recommendations by fusing multimodal data. Yang [16] introduced a modality-aware contrastive learning method that, through data augmentation and modality awareness, significantly improved recommendation effectiveness on short-video and e-commerce platforms, addressing data sparsity and noise issues. Mu and Wu [17] designed a multimodal deep learning-based movie recommendation system that alleviates cold-start and data sparsity issues by analyzing multimodal features of movies.

While multimodal learning enhances recommendation accuracy and performance by integrating multi-source data, it still faces challenges in capturing complex relationships between data and providing interpretability. Knowledge graphs, as structured information representation tools, can further enhance the system’s understanding of multimodal data by explicitly representing semantic associations and logical paths between users and items.

2.3. Knowledge Graph Recommendations

Knowledge graphs introduce rich semantic information and structured knowledge to recommendation systems, not only supplementing implicit relationships within multimodal data but also visualizing the connections between user preferences and item features through graph structures, thereby significantly enhancing recommendation accuracy and interpretability. The KGCL framework proposed by Yang [18] improves recommendation performance by suppressing noise within the knowledge graph, making it suitable for sparse interaction data scenarios. Huang [19] developed a path-enhanced recursive network model (PeRN), which enhances recommendation interpretability and addresses cold-start issues by mining multi-hop relational paths. Liu [20] proposed a multi-level aggregation-enhanced model that integrates interaction information and knowledge graph connectivity information, improving higher-order connectivity and expressiveness.

Knowledge-enhanced sequential recommendations combine knowledge graphs with sequential models to capture users’ long-term interests and short-term preferences more accurately. Hou [21] reviewed knowledge graph-based recommendation systems, discussing the role of knowledge graphs as auxiliary information. Sun [22] introduced a unified framework, MUKG, based on multi-task learning and knowledge graphs, which significantly improves recommendation performance, especially in multi-task learning scenarios. Chen [23] designed a new knowledge-enhanced graph convolutional network incorporating hypersurface geometry models, effectively modeling complex user–item interactions.

Knowledge graph-based interpretable recommendations improve recommendation transparency and system interpretability by integrating knowledge graphs with recommendation systems. Guo [24] reviewed knowledge graph-based recommendation systems, highlighting how knowledge graphs mitigate data sparsity and cold-start issues while providing more interpretable recommendations. Yang and Dong’s [25] hierarchical attention graph convolutional network (HAGERec) significantly enhances recommendation interpretability and accuracy by incorporating semantic information. He and Ke [26] reviewed recent research on knowledge graph-based recommendation systems, discussing applications in various domains and proposing future research directions.

3. Graph-Enhanced Multimodal Recommendation Method: GEMRec

This section provides a detailed introduction to the algorithm design of the graph-enhanced multimodal recommendation method (GEMRec), highlighting its innovations. GEMRec adopts multimodal feature fusion, cross-modal relationship capture, and similarity search technology based on graph edit distance to achieve a personalized and interpretable recommendation system. Below, each innovation and its technical details are introduced in detail (Figure 1).

3.1. Multimodal Feature Preprocessing

The main goal of text feature extraction is to obtain meaningful linguistic information from course descriptions, titles, subtitles, and other textual data. Initially, a pretrained BERT model is used to extract contextual definitions of words, and these are combined with bag-of-words (BoW) and TF-IDF features. A multilayer perceptron (MLP) network is designed to integrate BERT outputs with BoW and TF-IDF features, thereby capturing both semantic depth and statistical information, as follows:

h_{BERT} = H_{0} h_{text} = M L P ([h_{BERT}; v_{B o W}; v_{T F - I D F}])

(1)

The goal of video feature extraction is to capture the visual and temporal information within course videos. We use a combined approach, first applying a 3D Convolutional Neural Network (C3D) to capture spatial-temporal features,

f_{C 3 D}

, and incorporating 2D CNN with LSTM for sequential relationships. The video frames are uniformly sampled, extracting a series of key frames

{I_{1}, I_{2}, \dots, I_{m}}

, where each frame lil_ili undergoes 2D CNN processing to derive spatial features

f_{i} = C N N 2 D (I_{i}), i \in [1, m]

. These spatial features are then fed into an LSTM network to capture temporal dependencies, with the hidden state of the final synchronized layer hmh_mhm representing the sequence.

To merge features from C3D and LSTM, both short-term and long-term information is integrated to enhance the video feature representation, as follows:

α = s o f t m a x (W [f_{C 3 D}; h_{m}] + b)

(2)

h_{video} = α \cdot f_{C 3 D} + (1 - α) \cdot h_{m}

(3)

Audio feature extraction aims to capture acoustic and speech information in course audio. We use a Convolutional Recurrent Neural Network (CRNN) as the main feature extractor to extract local acoustic features F through convolutional layers, capture temporal dependencies H through recurrent layers, and then aggregate the features through a self-attention mechanism to generate a comprehensive audio representation, as follows:

A = s o f t m a x (W_{a} \cdot t a n h (W_{h} H^{⊤}))

(4)

where

W_{a}

and

W_{h}

are trainable parameter matrices, and A is the attention weight matrix.

To supplement the CRNN output with spectral information, we use Mel-frequency cepstral coefficients (MFCC) features, integrating them through a fully connected layer to further enhance fusion. This enables the model to fully leverage both deep learning and traditional audio features, as follows:

h_{audio} = R e L U (W_{1} h_{C R N N} + b_{1}) + R e L U (W_{2} m_{M F C C} + b_{2})

(5)

3.2. Multimodal Feature Fusion and Entity Extraction Strategy

In entity relationship extraction tasks, effectively integrating multimodal data is crucial for constructing comprehensive and accurate knowledge graphs. Online course resources typically contain rich text, video, and audio information, each modality describing course content from different dimensions. For instance, visual demonstrations in videos clearly illustrate hierarchical relationships between concepts, instructors’ gestures and expressions convey concept importance, and tonal variations and emphasis in audio highlight key knowledge points. Therefore, this section proposes an end-to-end multimodal entity relationship extraction framework that enhances extraction accuracy and completeness through deep fusion of these complementary information sources.

As illustrated in Figure 2, the process of multimodal entity relationship construction involves multiple steps, from data preprocessing to final knowledge graph generation.

3.2.1. Cross-Modal Preprocessing

For video data processing, this study employs deep learning-based computer vision techniques to extract entity information. The process begins with uniform sampling to obtain video keyframe sequences. Each keyframe is processed using the YOLOv5 object detection model to identify visual entities such as charts and formulas in teaching scenarios. Additionally, Optical Character Recognition (OCR) technology extracts textual information displayed on screens, while scene segmentation networks understand the overall semantic structure of teaching scenarios. These visual features not only supplement concept relationships not explicitly stated in text but also capture the progression of knowledge points through temporal analysis, as defined by

V_{t} = f_{v i s u a l} (Y O L O v 5 (F_{t}), O C R (F_{t}), S c e n e (F_{t}))

(6)

where

V_{t}

represents the comprehensive visual features at time

t

, and

F_{t}

denotes the keyframe at time t.

For audio data, we designed a multi-level processing pipeline. First, Automatic Speech Recognition (ASR) technology converts speech to text, preserving verbal expressions from the lectures. Second, acoustic feature analysis examines volume, speech rate, and pauses to identify emphasized concepts. The acoustic features are encoded as

A_{t} = f_{acoustic} ([{volume}_{t}, {pitch}_{t}, {pause}_{t}])

(7)

Furthermore, dialogue behavior analysis technology captures concept associations from teacher–student interactions, particularly evident in question–answer sessions and discussions.

To ensure temporal consistency across different modal features, we established a timestamp-based alignment mechanism. By mapping video frames, audio segments, and textual content onto a unified timeline, we achieve synchronized multimodal data analysis. This alignment mechanism provides the foundation for subsequent feature fusion, enabling the system to comprehensively utilize advantageous information from each modality, formulated as

M_{t} = [V_{t}; T_{t}; A_{t}]

(8)

where

M_{t}

represents the aligned multimodal features at time t, and

T_{t}

denotes the textual features.

3.2.2. Cross-Modal Relationship Capture

Before feature fusion, it is necessary to address the inconsistency in dimensions and scales of different modal features. We use feature alignment techniques to map text, video, and audio features into the same dimensional space, ensuring these features can be effectively compared and fused in the same semantic space. A specific projection network is designed for each modality, and feature normalization is performed at the application layer to unify the scale and distribution of features. Based on this, a cross-modal attention mechanism is introduced so that each modality can focus on relevant information from other modalities, thereby achieving deep feature fusion.

M u l t i H e a d (Q, K, V) = C o n c a t ({head}_{1}, \dots, {head}_{h}) W_{O}

(9)

Here, h represents the number of attention heads, and the output projection matrix is used to linearly transform the fused features. We perform this cross-modal attention calculation for all modality pairs to obtain enhanced feature representations.

3.2.3. Multimodal Feature Fusion Strategy

The effective integration of multimodal features requires careful consideration of both feature alignment and cross-modal interactions. Our fusion strategy addresses these challenges through a comprehensive multi-stage approach that ensures optimal information utilization from each modality while preserving their complementary characteristics.

Initially, we address the dimensional inconsistency among different modal features through a specialized feature alignment network. This network projects features from each modality into a shared semantic space while preserving their essential characteristics. For each modality m, we define a projection function as follows:

h_{m} = W_{m} x_{m} + b_{m}

(10)

where

W_{m}

and

b_{m}

are learnable parameters specific to each modality, and

x_{m}

represents the original features.

To capture the complex interactions between different modalities, we implement a cross-modal attention mechanism that allows each modality to focus on relevant information from others. The attention mechanism is formulated as a multi-head structure that can capture different aspects of cross-modal relationships, as follows:

{A t t e n t i o n}_{C r o s s} (Q_{i}, K_{j}, V_{j}) = s o f t m a x (\frac{Q_{i} K_{j}^{T}}{\sqrt{d_{k}}}) V_{j}

(11)

where

Q_{i}

,

K_{j}

, and

V_{j}

represent the query, key, and value matrices from modalities i and j, respectively, and

d_{k}

is the dimension of the key vectors.

To adaptively integrate information from different modalities while maintaining their unique contributions, we introduce a gated fusion mechanism. This mechanism learns to weigh the importance of each modality dynamically based on the current context.

g_{m} = σ (W_{g} [h_{t e x t}; h_{v i d e o}; h_{a u d i o}] + b_{g})

(12)

h_{f u s e d} = \sum_{m} g_{m} ⊙ h_{m}

(13)

where

g_{m}

represents the importance weight for modality m,

σ

is the sigmoid activation function, and

⊙

denotes element-wise multiplication.

The fusion process is further enhanced by incorporating residual connections to preserve the original information and facilitate gradient flow during training as follows:

h_{f i n a l} = L a y e r N o r m (h_{f u s e d} + α \sum_{m} h_{m})

(14)

where

α

is a learnable parameter that controls the contribution of the residual connection, and LayerNorm represents layer normalization to stabilize the learning process.

This sophisticated fusion strategy ensures that the model can effectively leverage complementary information from different modalities while being robust to potential noise or missing data in any single modality. The resulting fused representations capture rich semantic relationships that might be overlooked when considering each modality in isolation.

3.2.4. Entity and Relationship Extraction

This section presents MEREF (Multimodal Entity-Relation Extraction Framework), a unified framework for extracting entities and relations from multimodal data. MEREF transcends the limitations of traditional text-only approaches by deeply integrating textual, visual, and audio features, enabling end-to-end optimization of entity and relation extraction. The framework’s innovation lies in leveraging multimodal feature synergy to enhance entity recognition accuracy while establishing more precise entity relationship networks through graph-based reasoning.

In the entity recognition phase, the framework extends the BiLSTM-CRF model with a Multimodal Feature Fusion Gate (MFFG) mechanism. Unlike traditional approaches, MEREF incorporates temporally aligned video and audio features alongside word embeddings. For pretrained word embeddings wi, the framework first maps them to dense vector representations through a feature alignment network, as follows:

e_{i} = M L P ([w_{i}; v_{i}; a_{i}])

(15)

where vi and ai represent the video and audio feature vectors corresponding to word token wi. This early fusion of multimodal features provides a richer information foundation for subsequent sequence modeling. The feature fusion process is implemented through a dynamic gating mechanism as follows:

g_{t} = s i g m o i d (W_{g} [e_{i}; v_{i}; a_{i}] + b_{g}) h_{t} = g_{t} ⊙ c o n c a t (e_{i}, v_{i}, a_{i})

(16)

The BiLSTM layer captures long-range dependencies in the sequence through bidirectional processing as follows:

{\vec{h}}_{i} = {L S T M}_{f} (e_{i}, {\vec{h}}_{i - 1})

(17)

{\overset{\leftarrow}{h}}_{t} = {L S T M}_{b} (e_{i}, {\overset{\leftarrow}{h}}_{i + 1})

(18)

h_{i} = [{\vec{h}}_{i}; {\overset{\leftarrow}{h}}_{i}]

(19)

The CRF layer considers global constraints in the label sequence, computing the probability of the optimal label sequence as follows:

s (X, Y) = \sum_{i = 1}^{n} (W_{y_{i}}^{⊤} h_{i} + b_{y_{i}} + T_{y_{i - 1}, y_{i}})

(20)

where T represents the label transition matrix, modeling dependencies between adjacent labels. This multimodal-enhanced sequence labeling method comprehensively utilizes information from different modalities to improve entity boundary recognition accuracy.

The relation classification module employs a multi-level reasoning architecture based on graph attention networks. For identified entity pairs <

e_{1}

,

e_{2}

>, the module first extracts multimodal context features:

e_{1}^{'} = e_{1} + W_{e_{1}} h_{final}

(21)

e_{2}^{'} = e_{2} + W_{e_{2}} h_{final}

(22)

where

h_{final}

represents the fused multimodal features. The relationship classification between entity pairs is achieved through an attention-based mechanism. For each entity pair, we first compute an attention weight

α_{i j}

that captures the importance of their interaction as follows:

α_{i j} = a t t e n t i o n (e_{i}, e_{j}, C_{i j})

(23)

In this formula,

e_{i}

and

e_{j}

represent the enhanced entity embeddings that incorporate multimodal features, while

C_{i j}

encodes the contextual information between these entities, including textual context from course descriptions, visual context from video segments, and temporal relationships from the learning sequence. The attention function computes a normalized importance score through learnable parameters, enabling the model to focus on relevant contextual features for relationship determination.

Based on these attention weights, the probability distribution over possible relation types is computed as

r_{i j} = s o f t m a x (W_{r} [e_{1}^{'}; e_{2}^{'}; f_{fused}] + b_{r})

(24)

where

W_{r}

is a learnable transformation matrix that projects the concatenated entity and context representations into the relation space, and

b_{r}

is a bias term. This formulation allows the model to consider both entity-specific features and their contextual interactions when determining relationships. The softmax operation normalizes the output to obtain probability scores for each possible relation type, ensuring that the model can capture various educational relationships such as prerequisites, correlations, and hierarchical structures.

To capture complex relationship patterns beyond direct entity pairs, we incorporate a path-aware reasoning mechanism based on graph attention networks. For each entity pair (

e_{i}

,

e_{j}

), the model aggregates information from multiple possible paths in the knowledge graph as follows:

h_{p a t h} = G A T (e_{i}, e_{j}, P_{i j}) = Σ_{k} α_{k} \cdot f_{θ} (p_{k})

(25)

where

P_{i j}

represents the set of valid paths between entities

e_{i}

and

e_{j}

,

α_{k}

denotes the attention weight for path

p_{k}

, computed based on path relevance to the current relationship classification task, and

f_{θ}

is a learnable transformation function that encodes path semantics.

The final relationship representation combines both direct entity interaction and path-based reasoning. This dual-perspective approach enables the model to leverage both local entity features and global graph structure information.

3.3. Similarity Search Based on Knowledge Graph Edit Distance

This section explains the recommendation research on similarity search for the course knowledge graph described in Section B. The core idea is to use a Graph Neural Network (GNN) to model and solve graph edit distance. The specific process is as depicted in Figure 3.

Before discussing the graph edit distance calculation between the learner’s knowledge state and course content, it is essential to explain how the learner’s knowledge state graph

G_{l e a r n e r}

is constructed. The knowledge state graph represents a learner’s current understanding and mastery level of different concepts through a structured representation.

The construction of

G_{l e a r n e r}

involves analyzing learning interaction sequences and concept relationships. For each knowledge concept

k_{i}

in the learning process, we calculate its mastery level through

s (k_{i}) = Σ (β_{j} \cdot I (k_{i}, b_{j}))

(26)

where

b_{j}

represents learning behaviors related to concept

k_{i}

(such as video watching duration, exercise completion, and quiz performance),

I (k_{i}, b_{j})

measures the contribution of learning behavior to concept mastery, and

b_{j}

is the weight coefficient determined by behavior importance.

The relationship strength between concepts,

r (k_{i}, k_{j})

, is computed based on both course structure and learning sequence, as follows:

r (k_{i}, k_{j}) = λ_{1} \cdot c (k_{i}, k_{j}) + λ_{2} \cdot t (k_{i}, k_{j})

(27)

where

c (k_{i}, k_{j})

indicates the correlation in course structure,

t (k_{i}, k_{j})

represents temporal correlation strength, and

λ_{1}

,

λ_{2}

are balancing parameters.

The knowledge state graph

G_{l e a r n e r}

is then formally defined as

G_{l e a r n e r} = (V_{l}, E_{l}, W_{l})

(28)

where

V_{l}

is the set of concept nodes,

E_{l}

represents the edge set, and

W_{l}

denotes the weight matrices for both nodes and edges.

Given two graphs,

G_{1}

and

G_{2}

, the graph edit distance GED(

G_{1}

,

G_{2}

)is defined as the minimum number of operations required to transform

G_{1}

into

G_{2}

. Formally, this can be represented as

G E D (G 1, G 2) = m i n {| e | | e \in E (G 1, G 2)}

(29)

where E(

G_{1}

,

G_{2}

) is the set of all possible edit paths from

G_{1}

to

G_{2}

, and

| e |

denotes the number of operations in the edit path

e

. These operations include adding/deleting nodes and adding/deleting edges.

To solve this NP-hard problem, we designed a multi-layer GNN model. The first layer is a node feature extraction layer, using a graph attention network (GAT) to capture local structural information. For node i, its feature update at layer l can be expressed as

h_{i}^{(l + 1)} = σ (\sum_{j \in N (i)} α_{i j}^{(l)} W^{(l)} h_{j}^{(l)})

(30)

where

N (i)

is the set of neighboring nodes of i,

α_{i j}^{(l)}

is the attention weight,

W^{(l)}

is a parameter matrix, and σ\sigmaσ is the activation function. This design allows the model to adaptively aggregate information from neighboring nodes, effectively capturing the local graph structure.

α_{i j}^{(l)} = \frac{\exp (L e a k y R e L U (a^{⊤} [W^{(l)} h_{i}^{(l)}; W^{(l)} h_{j}^{(l)}]))}{\sum_{k \in N (i)} e x p (L e a k y R e L U (a^{⊤} [W^{(l)} h_{i}^{(l)}; W^{(l)} h_{k}^{(l)}]))}

(31)

After node feature extraction, a differentiable pooling operation (DiffPool) is used to assign nodes to different clusters, reducing the graph’s scale while preserving structural information. The pooling process is defined as

S^{(l)} = s o f t m a x ({G N N}_{pool}^{(l)} (A^{(l)}, X^{(l)}))

(32)

X^{(l + 1)} = S^{(l) ⊤} X^{(l)}

(33)

A^{(l + 1)} = S^{(l) ⊤} A^{(l)} S^{(l)}

(34)

where

S^{(l)}

is the soft assignment matrix,

A^{(l)}

is the adjacency matrix, and

X^{(l)}

is the node feature matrix. This pooling operation enables the model to handle graphs of varying sizes and extract hierarchical representations.

Finally, a Multi-Layer Perceptron maps the pooled graph representation to a fixed-dimensional embedding space, compressing the graph’s structural information into a fixed-dimensional vector for subsequent similarity calculations.

Through this GNN model, the graph edit distance calculation can be approximated as a distance measure in the embedding space. This approximation greatly reduces computational complexity, enabling real-time similarity calculation on large-scale knowledge graphs.

In practical applications, the learner’s knowledge state is represented as a subgraph

G_{learner}

of the knowledge graph, and each course’s content is also represented as a subgraph

G_{course}

. By calculating the approximate graph edit distance between these two subgraphs, we can obtain the similarity between the learner’s knowledge state and the course content:

S i m (l e a r n e r, c o u r s e) = \frac{1}{1 + {G E D}_{approx} (G_{learner}, G_{course})}

(35)

3.4. Interpretable Methods for Graph-Semantic Enhancement

In a personalized course recommendation system, providing interpretable recommendation results not only increases user trust but also helps learners better understand the reasons behind the recommendations. This section discusses two core interpretability methods in detail: SHAP-based feature importance analysis and a recommendation explanation generation algorithm. As shown in Figure 4, the interpretable recommendation framework combines SHAP-based feature analysis with semantic enhancement to provide transparent and meaningful explanations.

3.4.1. SHAP-Based Feature Importance Analysis

SHAP (SHapley Additive exPlanations) is a game theory-based method used to interpret the predictions of machine learning models. In our course recommendation system, SHAP values help us understand the contribution of each feature to the recommendation decision.

We define a recommendation model f, which maps input features x to a recommendation score y, where x may include features such as the student’s historical learning records, interest preferences, and node attributes within the knowledge graph. The core idea of the SHAP method is to calculate the marginal contribution of each feature to the final prediction. For a feature i, its SHAP value

ϕ_{i}

is defined as follows:

ϕ_{i} = \sum_{S \subseteq N} \frac{| S |! (| N | - | S | - 1)!}{| N |!} [f (S \cup {i}) - f (S)]

(36)

where N is the set of all features, S is a subset of features excluding iii, and

f (S)

represents the model’s prediction when only using the feature subset S.

However, directly computing SHAP values has exponential complexity, making it challenging to apply in large-scale recommendation systems. To address this, we use an approximation algorithm that employs Monte Carlo sampling to estimate SHAP values. For each feature, we randomly select mmm feature subsets and compute the average marginal contribution as follows:

ϕ_{i} \approx \frac{1}{m} \sum_{j = 1}^{m} [f (S_{j} \cup {i}) - f (S_{j})]

(37)

where

S_{j}

is the j-th randomly selected feature subset. This allows us to approximate the importance of each feature in the recommendation decision.

3.4.2. Graph-Semantic Enhancement with Large Models

After understanding feature importance, generating readable recommendation explanations becomes crucial. To achieve this, we designed a multi-stage recommendation explanation generation algorithm, combining template-based methods with natural language generation techniques to provide explanations for each recommendation. First, features that impact the recommendation decision are ranked based on SHAP values, and the Top-K most important features (typically 3 to 5) are selected as the basis for generating the recommendation explanation, balancing completeness and conciseness.

Once the key features are identified, the system maps each feature to a corresponding text template. For example, for “knowledge graph concept similarity”, a high similarity reason might be: “This course covers concepts that are highly relevant to your current knowledge base”. In cases of low similarity, the explanation focuses on expanding the learner’s knowledge domain. Additionally, by incorporating user background information, the system uses a context-aware mechanism and the user’s learning history to select the most appropriate template, with a pretrained language model used to polish the explanation, ensuring a natural and fluent flow. Based on user preferences, the system also fine-tunes the explanation to maintain personalization and brevity.

To avoid repetitive explanations, the system includes a dynamic template pool that records recently used templates and prioritizes varied expressions. This approach ensures that the final generated recommendation explanations are personalized and evidence-based and that they enhance the interpretability of recommendations.

4. Experiments and Results Analysis

4.1. Experimental Design

This chapter provides a detailed description of the experimental design, including dataset introduction, data processing, experimental setup, and specific objectives of the experiment.

4.1.1. Dataset Introduction

This experiment utilizes the MOOCCubeX dataset [27] provided by Tsinghua University’s Knowledge Engineering Laboratory. Supported by “XuetangX”, one of China’s largest MOOC platforms, MOOCCubeX is one of the most comprehensive open education datasets in terms of scale and coverage. It encompasses multiple modalities of course resources and student behavior data, providing extensive research material and analytical bases.

The MOOCCubeX dataset mainly comprises two parts:

(1): Course Resource Data

This includes course information, videos, exercises, and course concepts. The course information section covers 4216 courses, video data contains metadata for 230,263 videos, and exercise data includes 358,265 exercises with over 1.2 million related questions. The course concepts section includes 637,572 fine-grained course concepts, providing links between these concepts and the corresponding videos and exercises.

(2): Student Behavior Data

This part consists of user profiles for 3,330,294 students, over 296 million video viewing records, and 21 GB of exercise responses. These data provide a robust foundation for modeling and analyzing learner behavior.

Figure 5 illustrates the structure and processing flow of the MOOCCubeX dataset, encompassing the handling of course resources and student behavior data, concept extraction, and knowledge graph construction. Course resources cover videos, exercises, and course concepts, while student behavior data records user learning activities. Through a concept acquisition module, the system associates fine-grained course concepts with learning resources to build a concept graph, which, in conjunction with external resources (e.g., academic papers, technical Q&A), expands the knowledge coverage.

4.1.2. Experimental Setup

For this study, 500 representative computer science courses were selected from the MOOCCubeX dataset based on criteria such as concept coverage, video quality, and student engagement. The video data underwent basic preprocessing of text, video, and audio features, with named entity recognition applied to extract concept entities from course descriptions and subtitles. This multimodal information was integrated to enhance the accuracy of the knowledge graph. Learner behavior data was filtered to retain records related only to computer science courses, and personalized knowledge subgraphs were generated based on historical behavior to support the recommendation model.

The experimental hardware included an Intel Xeon Gold 6248R CPU, an RTX4090 24GB GPU, and 64GB of memory (Santa Clara, CA, USA), with the software environment comprising Ubuntu 20.04 LTS and Python 3.9. MOOCCubeX, as a publicly available dataset, was processed according to standard protocols, with a focus on multimodal information utilization and knowledge graph construction.

For model implementation, we set the following hyperparameters based on preliminary experiments and model architecture requirements. In the multimodal feature preprocessing stage, we employed a pretrained base model with 768 hidden dimensions for text feature extraction. For video processing, we sampled frames at 2 frames per second and used a C3D network with 512-dimensional output features. The CRNN for audio processing was configured with 256 hidden units. In the multimodal fusion module, we implemented an 8-head attention mechanism with an attention dimension of 64 per head, and the gated fusion unit was set with a hidden dimension of 512. The graph attention network for knowledge graph processing consisted of 3 layers, each with 8 attention heads, and the differentiable pooling layer was configured to reduce the number of nodes to 1/4 of the original count. For model training, we utilized the Adam optimizer with an initial learning rate of 0.001, which was reduced by a factor of 0.1 every 20 epochs. The batch size was set to 256, and the model was trained for 100 epochs with early stopping monitored on validation loss with a patience of 10 epochs. The loss function combined recommendation loss and graph edit distance loss with weights of 1.0 and 0.3, respectively. We applied dropout with a rate of 0.3 and L2 regularization with a coefficient of 1 × 10⁻⁵ to prevent overfitting.

4.2. Multimodal Knowledge Graph Construction Experiments

4.2.1. Multimodal Feature Fusion Effectiveness

To evaluate the impact of multimodal feature fusion, this section introduces different variables in the experiment to explore the effects of multimodal combinations and fusion strategies, as well as the relative importance of each type of feature. The experiment covers multiple scenarios, from single-modality and dual-modality combinations to full multimodal fusion, comparing various fusion strategies from simple feature concatenation to multimodal transformers. The dataset includes large-scale online course resources spanning text, video, and audio modalities. The evaluation metrics include accuracy, F1 score, and NDCG, which collectively reflect classification accuracy, a balance of precision and recall, and the ranking quality of recommendations.

Figure 6 shows a comparison of modality fusion effects. The results indicate that multimodal fusion has significant advantages in knowledge graph construction. The accuracy of the single-text modality is 0.75, which provides basic semantic representation but cannot capture the rich information contained in video and audio. The text and video combination performs well, achieving an accuracy of 0.82, as video data enhances semantic expression through visual demonstration and dynamic content, offering additional support for understanding complex concepts. In contrast, the text and audio combination achieves an accuracy of 0.75, as audio only supplements information in terms of emotion and tone, making it slightly less effective than the text and video combination. Full multimodal fusion (text, video, and audio) yields the best performance, with an accuracy of 0.88, indicating the strong complementarity of multimodal information. By integrating all modalities, the system can more comprehensively capture complex semantics and relationships.

4.2.2. Entity and Relationship Extraction Results

Through deep learning and distant supervision methods, this study successfully extracted a large number of entities and relationships from extensive course resources. Due to the vast size of the original database, which includes course resources across multiple disciplines, computer science courses were selected as examples to visually demonstrate the experimental results. This choice not only showcases the method’s effectiveness but also reflects current trends in popular technology fields and education.

For entity extraction, based on multimodal features and ranked by importance, three main entity types were identified: course names, knowledge points, and skills. This section does not include the following course entities in the visualizations: “course content”, “difficulty level”, “duration”, “rating”, “interactivity”, “multimedia type”, “teaching style”, “course structure”, “learning objectives”, “prerequisites”, “update frequency”, “language”, “certification type”, “course category”, “subject area”, “practical opportunities”, “assignment difficulty”, and “exam format”. The resulting entity relationship graph is shown in Figure 7.

The extracted relationships and their interconnections within the recommendation system are visualized in Figure 8, demonstrating the complex network of educational concepts and their associations.

Based on the extracted entities and relationships, a knowledge graph of computer science course resources was constructed, as shown in Figure 7. This graph illustrates the complex associations between courses, knowledge points, and skills, highlighting the interdisciplinary nature of the subject and the integration of theory and practice. Analysis reveals that the degree distribution of nodes follows a power law, with a few core concepts (such as “recommendation system” and “machine learning”) exhibiting high connectivity. The network displays a community structure, with subgraphs forming around topics such as machine learning and data structures. Centrality analysis indicates that nodes like “recommendation system” and “machine learning” play a critical role in knowledge transmission.

4.3. Knowledge Graph Similarity Calculation Experiment

This section provides a detailed analysis of the experimental results of the recommendation model based on graph edit distance within the course knowledge graph system. Our experiments cover aspects such as knowledge graph storage structure, cache performance, graph edit distance approximation accuracy, node embedding effectiveness, course similarity analysis, model training process, and a final comparison of recommendation accuracy.

Graph edit distance serves as a core component of our recommendation model. Figure 9 presents the accuracy analysis results of the graph edit distance approximation algorithm.

Figure 10 shows the approximation accuracy of the graph edit distance. The scatter plot indicates that most points are distributed near the ideal fitting line, demonstrating that our approximation algorithm can accurately estimate the actual graph edit distance. The calculated correlation coefficient is 0.9569, suggesting that using an efficient approximation method instead of an exact but time-consuming graph edit distance calculation can improve system response speed while maintaining accuracy. Figure 9 presents the similarity matrix between different courses in the form of a heat map.

The changes in the loss function and validation accuracy during model training are shown in Figure 11.

The curves for training and validation loss and accuracy indicate good model performance. As training progresses, the loss decreases from 0.6 to approximately 0.1, and accuracy increases from 0.4 to nearly 1.0, with validation accuracy reaching 0.9, indicating good generalization capability.

We compared the performance of different recommendation methods using Precision@10, Recall@10, and NDCG@10 as metrics. Precision@10 measures the accuracy of recommendations, Recall@10 reflects the system’s coverage of user preferences, and NDCG@10 evaluates the ranking quality of recommendations, with higher values indicating more accurate, comprehensive, and well-ordered recommendations.

As shown in Table 1, our method achieved a Precision@10 of 0.267, significantly outperforming various recommendation algorithms. This result demonstrates that the feature representation enhanced by multimodal fusion and knowledge graphs enables more accurate recommendations. For Recall@10, our model reached 0.265, exceeding most comparison methods except CAmgr, which also leverages multimodal information fusion and graph structure processing. Our model achieved an NDCG@10 of 0.297, surpassing other algorithms, including CAmgr, which validates that our unique combination of cross-attention mechanism and knowledge graph path inference optimizes recommendation ranking more effectively. While CAmgr demonstrates strong performance due to its similar technical approach of incorporating multimodal fusion and graph processing, our method’s superior results can be attributed to the innovative integration of graph edit distance and differentiable pooling. Traditional models and earlier deep learning approaches performed less effectively, mainly due to their limited ability to process multimodal information and leverage knowledge graph structures.

Notably, in cold-start situations (with fewer than five user interactions), our method performed even better, as shown in Figure 12.

As shown in Figure 12, we specifically evaluated the algorithms’ performance under cold-start conditions, where users have fewer than five interactions with the system. This scenario is particularly challenging yet crucial in educational recommendations, as new learners need accurate course suggestions before accumulating sufficient interaction history. Our GEMRec method achieved an NDCG@10 of 0.372 in these cold-start scenarios, outperforming both KGCN (0.348) and CAmgr (0.364). Most notably, the traditional collaborative filtering methods (user-based CF: 0.185, item-based CF: 0.201) performed significantly worse due to their heavy reliance on user interaction histories. This demonstrates that GEMRec’s approach of leveraging course knowledge structure through graph edit distance and comprehensive content understanding through multimodal features is particularly effective for addressing the cold-start challenge in educational recommendations.

4.4. Recommendation Interpretability Experiment

In this section, we focus on evaluating the interpretability of the recommendation system, including SHAP-based feature importance analysis, quality assessment of generated recommendation explanations, and the impact of interpretability on user experience.

We used SHAP (SHapley Additive exPlanations) values to analyze the contribution of different features to the recommendation results. Figure 13 shows the global feature importance ranking.

From Figure 13, we can see significant differences in the influence of various features on the recommendation system. The x-axis represents feature importance, and the y-axis shows the average impact of each feature. Features in the upper right corner have a substantial impact on the system’s decisions, while those in the lower left have a relatively smaller impact. The size of each dot represents feature usage frequency, with larger dots indicating higher frequency. Among these, features like “knowledge graph coverage”, “current learning goal”, and “course rating” have higher SHAP values, indicating their crucial role in the model’s decision-making. For example, knowledge graph coverage effectively reflects the match between user learning needs and the course, thus enhancing recommendation accuracy.

In contrast, features such as “personalized recommendation score”, “interactivity”, and “available time” have lower SHAP values, indicating that they have a relatively smaller impact on the final result. Although these features may be useful in specific user scenarios, their overall contribution to recommendation effectiveness is limited.

We evaluated the generated recommendation explanations through both automated and manual assessments, as shown in Table 2.

In the automated evaluation, the model-generated explanations show a certain degree of linguistic complexity, with a BLEU-4 score of 0.42, indicating moderate similarity between generated text and reference text in terms of syntax and content. Overall, the recommendation explanations perform well in language clarity and logical coherence, effectively persuading users. However, there is still room for improvement in matching the reference text, suggesting the model could be enhanced in generating more reasonable and cohesive explanations.

To assess the impact of interpretability on user experience, we conducted an A/B test with 93 students. Figure 14 compares the results of recommendations with and without explanations.

According to the A/B test results, adding interpretability significantly improved user experience. The explainable recommendation system (Group A) outperformed in explanation acceptance, need fulfillment, recommendation accuracy, and system trust. Especially in explanation acceptance and system trust, users’ understanding and confidence in the recommendations were significantly enhanced, although the improvement in course click-through rate was limited.

5. Conclusions

This study proposes an innovative personalized course recommendation system that addresses information overload in online education through the integration of multimodal deep learning and knowledge graphs. By effectively combining multimodal data fusion with structured knowledge representation, the system achieves significant improvements in both recommendation accuracy and interpretability.

The research presents three key innovations: a novel GEMRec framework for multimodal feature integration that dynamically combines textual, visual, and audio information; a graph edit distance approach for measuring structural similarities between learner knowledge states and course content; and a graph-semantic-enhanced interpretability framework that generates reliable recommendation explanations through the combination of SHAP values and large language models.

Experiments on the MOOCCubeX dataset demonstrate the system’s superiority, achieving Precision@10, Recall@10, and NDCG@10 scores of 0.267, 0.265, and 0.297, respectively, significantly outperforming existing methods, particularly in cold-start scenarios. The system also demonstrates strong interpretability, generating meaningful explanations that enhance user trust and engagement.

While the computational complexity of multimodal processing and knowledge graph operations presents challenges for large-scale deployment, future research directions include the exploration of optimization techniques, the incorporation of dynamic learning behaviors, and the adaptation of the framework for diverse educational contexts. These developments aim to advance the field of personalized learning systems, working toward more intelligent and adaptive educational platforms that can provide real-time learning path optimization and sophisticated educational feedback.

Author Contributions

Conceptualization, E.W.; Software, E.W.; Validation, E.W.; Resources, Z.S.; Writing—original draft, E.W.; Writing—review & editing, E.W. and Z.S.; Visualization, E.W.; Supervision, Z.S.; Project administration, Z.S.; Funding acquisition, Z.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 61972208 and Grant No. 62272239) and the Jiangsu Agriculture Science and Technology Innovation Fund (JASTIF) (Grant No. CX(22)1007).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in MOOCCubeX at https://github.com/THU-KEG/MOOCCubeX (accessed on 2 October 2024) as described in ref. [27].

Conflicts of Interest

The authors declare no conflict of interest.

References

Andika, H.G.; Hadinata, M.T.; Huang, W.; Anderies, A.; Iswanto, I.A.I. Systematic Literature Review: Comparison on Collaborative Filtering Algorithms for Recommendation Systems. In Proceedings of the 2022 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT), Solo, Indonesia, 3–5 November 2022; pp. 56–61. [Google Scholar]
Javed, U.; Shaukat, K.; Hameed, I.; Iqbal, F.; Alam, T.M.; Luo, S. A Review of Content-Based and Context-Based Recommendation Systems. Int. J. Emerg. Technol. Learn. 2021, 16, 274–306. [Google Scholar] [CrossRef]
Wahyudi, K.; Latupapua, J.; Chandra, R.; Girsang, A.S. Hotel Content-Based Recommendation System. J. Phys. Conf. Ser. 2020, 1485, 012017. [Google Scholar] [CrossRef]
Parthasarathy, G.; Shanmugam, S.D. Hybrid Recommendation System Based on Collaborative and Content-Based Filtering. Cybern. Syst. 2022, 54, 432–453. [Google Scholar] [CrossRef]
Tolety, V.B.P.; Prasad, E.V. Hybrid Content and Collaborative Filtering Based Recommendation System for E-Learning Platforms. Bull. Electr. Eng. Inform. 2022, 11, 1543–1549. [Google Scholar] [CrossRef]
Sharma, S.; Rana, V.; Malhotra, M. Automatic Recommendation System Based on Hybrid Filtering Algorithm. Educ. Inf. Technol. 2021, 27, 1523–1538. [Google Scholar] [CrossRef]
Liu, H. Implementation and Effectiveness Evaluation of Four Common Algorithms of Recommendation Systems—User Collaboration Filter, Item-based Collaborative Filtering, Matrix Factorization and Neural Collaborative Filtering. In Proceedings of the 2022 International Conference on Cloud Computing, Big Data Applications and Software Engineering (CBASE), Suzhou, China, 23–25 September 2022; pp. 224–227. [Google Scholar] [CrossRef]
Hekmatfar, T.; Haratizadeh, S.; Razban, P.; Goliaei, S. Attention-Based Recommendation on Graphs. arXiv 2022. [Google Scholar] [CrossRef]
Zhao, J.; Zhao, P.; Zhao, L.; Liu, Y.; Sheng, V.; Zhou, X. Variational Self-attention Network for Sequential Recommendation. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 1559–1570. [Google Scholar] [CrossRef]
Xu, C.; Feng, J.; Zhao, P.; Zhuang, F.; Wang, D.; Liu, Y.; Sheng, V. Long- and Short-Term Self-Attention Network for Sequential Recommendation. Neurocomputing 2021, 423, 580–589. [Google Scholar] [CrossRef]
Li, J.; Wang, Y.; McAuley, J. Time Interval Aware Self-Attention for Sequential Recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020. [Google Scholar] [CrossRef]
Zhang, L.; Zhou, X.; Shen, Z. Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning. arXiv 2023. [Google Scholar] [CrossRef]
Malitesta, D.; Gassi, G.; Pomo, C.; Noia, T.D. Ducho: A Unified Framework for the Extraction of Multimodal Features in Recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023. [Google Scholar] [CrossRef]
Liu, F.; Cheng, Z.; Chen, H.-M.; Liu, A.-A.; Nie, L.; Kankanhalli, M.S. Disentangled Multimodal Representation Learning for Recommendation. arXiv 2022. [Google Scholar] [CrossRef]
Wang, X.; Chen, H.; Zhu, W. Multimodal Disentangled Representation for Recommendation. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
Yang, W.; Fang, Z.; Zhang, T.; Wu, S.; Lu, C. Modal-Aware Bias Constrained Contrastive Learning for Multimodal Recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023. [Google Scholar] [CrossRef]
Mu, Y.; Wu, Y. Multimodal Movie Recommendation System Using Deep Learning. Mathematics 2023, 11, 895. [Google Scholar] [CrossRef]
Yang, Y.; Huang, C.; Xia, L.; Li, C. Knowledge Graph Contrastive Learning for Recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022. [Google Scholar] [CrossRef]
Huang, Y.; Zhao, F.; Gui, X.; Jin, H. Path-Enhanced Explainable Recommendation with Knowledge Graphs. World Wide Web 2021, 24, 1769–1789. [Google Scholar] [CrossRef]
Liu, X.; Song, R.; Wang, Y.; Xu, H. A Multi-Granular Aggregation-Enhanced Knowledge Graph Representation for Recommendation. Information 2022, 13, 229. [Google Scholar] [CrossRef]
Hou, S.; Wei, D. Research on Knowledge Graph-Based Recommender Systems. In Proceedings of the 2023 3rd International Symposium on Computer Technology and Information Science (ISCTIS), Chengdu, China, 7–9 July 2023; pp. 737–742. [Google Scholar] [CrossRef]
Sun, J.; Shagar, M.M.B. MUKG: Unifying Multi-Task and Knowledge Graph Method for Recommender System. In Proceedings of the 2020 2nd International Conference on Image Processing and Machine Vision, Bangkok, Thailand, 5–7 August 2020. [Google Scholar] [CrossRef]
Chen, Y.; Yang, M.; Zhang, Y.; Zhao, M.; Meng, Z.; Hao, J.; King, I. Modeling Scale-Free Graphs for Knowledge-Aware Recommendation. arXiv 2021. [Google Scholar] [CrossRef]
Guo, Q.; Zhuang, F.; Qin, C.; Zhu, H.; Xie, X.; Xiong, H.; He, Q. A Survey on Knowledge Graph-Based Recommender Systems. IEEE Trans. Knowl. Data Eng. 2020, 34, 3549–3568. [Google Scholar] [CrossRef]
Yang, Z.; Dong, S. HAGERec: Hierarchical Attention Graph Convolutional Network Incorporating Knowledge Graph for Explainable Recommendation. Knowl. Based Syst. 2020, 204, 106194. [Google Scholar] [CrossRef]
He, X.; Ke, X. Research Summary of Recommendation System Based on Knowledge Graph. In Proceedings of the 2021 3rd International Conference on Big Data Engineering, Shanghai, China, 26–28 May 2021. [Google Scholar] [CrossRef]
Fan, Y.; Liu, S.; Li, W.; Zhao, Y.; Zhang, M.; Tang, J. MOOCCubeX: A Large-Scale Data Repository for NLP Applications in MOOCs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 15 November 2021. [Google Scholar]
Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Item-based Collaborative Filtering Recommendation Algorithms. In Proceedings of the 10th International Conference on World Wide Web, Hong Kong, China, 1–5 May 2001. pp. 285–295.
Aggarwal, C.C. Content-based Recommender Systems. In Recommender Systems: The Textbook; Springer: Berlin/Heidelberg, Germany, 2016; pp. 139–166. [Google Scholar]
Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A Factorization-Machine Based Neural Network for CTR Prediction. arXiv 2017, arXiv:1703.04247. [Google Scholar]
He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.-S. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 173–182. [Google Scholar]
Wang, X.; He, X.; Cao, Y.; Liu, M.; Chua, T.-S. KGAT: Knowledge Graph Attention Network for Recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 950–958. [Google Scholar]
Wang, H.; Zhao, M.; Xie, X.; Li, W.; Guo, M. Knowledge Graph Convolutional Networks for Recommender Systems. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 3307–3313. [Google Scholar]
Li, K.; Zhang, J.; Liu, Y.; Zhang, H. A Multimodal Graph Recommendation Method Based on Cross-Attention Fusion. Mathematics 2024, 12, 2353. [Google Scholar] [CrossRef]

Figure 1. GEMRec algorithm structure diagram.

Figure 2. Multimodal knowledge graph entity relationship construction process.

Figure 3. Workflow of recommendation method based on graph similarity.

Figure 4. Workflow of interpretable methods for graph-semantic enhancement.

Figure 5. Composition of MOOCCubeX dataset.

Figure 6. Comparison diagram of modal fusion effect.

Figure 7. Entity relationship extraction results for computer courses (partially visualized).

Figure 8. Recommendation system entity relationship extraction results (partially visualized).

Figure 9. Analysis of approximate accuracy of graph editing distance.

Figure 10. Course similarity heatmap.

Figure 11. Training process log.

Figure 12. Performance comparison of various methods under cold-start conditions.

Figure 13. SHAP-based global feature importance.

Figure 14. Comparison of user experience with and without explanations.

Table 1. Comparison of effects of different methods.

	Precision@10	Recall@10	NDCG@10
User-based CF [28]	0.142	0.156	0.183
Item-based CF [28]	0.158	0.142	0.201
CBR [29]	0.173	0.189	0.215
DeepFM [30]	0.231	0.228	0.242
NCF [31]	0.213	0.229	0.236
KGAT [32]	0.265	0.257	0.281
KGCN [33]	0.261	0.259	0.288
CAmgr [34]	0.264	0.262	0.293
GEMRec *	0.267	0.265	0.297

Table 2. Quality assessment of recommendation explanations.

Evaluation Method	Metric	Score
Automated	Perplexity	15.3
Automated	BLEU-4	0.42
Manual	Readability	4.2
	Relevance	4.3
	Persuasiveness	4.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, E.; Sun, Z. Research on Personalized Course Resource Recommendation Method Based on GEMRec. Appl. Sci. 2025, 15, 1075. https://doi.org/10.3390/app15031075

AMA Style

Wang E, Sun Z. Research on Personalized Course Resource Recommendation Method Based on GEMRec. Applied Sciences. 2025; 15(3):1075. https://doi.org/10.3390/app15031075

Chicago/Turabian Style

Wang, Enliang, and Zhixin Sun. 2025. "Research on Personalized Course Resource Recommendation Method Based on GEMRec" Applied Sciences 15, no. 3: 1075. https://doi.org/10.3390/app15031075

APA Style

Wang, E., & Sun, Z. (2025). Research on Personalized Course Resource Recommendation Method Based on GEMRec. Applied Sciences, 15(3), 1075. https://doi.org/10.3390/app15031075

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Personalized Course Resource Recommendation Method Based on GEMRec

Abstract

1. Introduction

2. Related Work

2.1. Recommendation Models

2.2. Multimodal Learning

2.3. Knowledge Graph Recommendations

3. Graph-Enhanced Multimodal Recommendation Method: GEMRec

3.1. Multimodal Feature Preprocessing

3.2. Multimodal Feature Fusion and Entity Extraction Strategy

3.2.1. Cross-Modal Preprocessing

3.2.2. Cross-Modal Relationship Capture

3.2.3. Multimodal Feature Fusion Strategy

3.2.4. Entity and Relationship Extraction

3.3. Similarity Search Based on Knowledge Graph Edit Distance

3.4. Interpretable Methods for Graph-Semantic Enhancement

3.4.1. SHAP-Based Feature Importance Analysis

3.4.2. Graph-Semantic Enhancement with Large Models

4. Experiments and Results Analysis

4.1. Experimental Design

4.1.1. Dataset Introduction

4.1.2. Experimental Setup

4.2. Multimodal Knowledge Graph Construction Experiments

4.2.1. Multimodal Feature Fusion Effectiveness

4.2.2. Entity and Relationship Extraction Results

4.3. Knowledge Graph Similarity Calculation Experiment

4.4. Recommendation Interpretability Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI