Next Article in Journal
Research on Weighted Network Model Construction and Layout Structure of New Energy Vehicle Charging Station
Previous Article in Journal
Truck Transportation Scheduling for a New Transport Mode of Battery-Swapping Trucks in Open-Pit Mines
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

UPGCN: User Perception-Guided Graph Convolutional Network for Multimodal Recommendation

College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266000, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(22), 10187; https://doi.org/10.3390/app142210187
Submission received: 10 October 2024 / Revised: 4 November 2024 / Accepted: 5 November 2024 / Published: 6 November 2024
(This article belongs to the Special Issue AI-Supported Decision Making and Recommender Systems)

Abstract

:
To tackle the challenges of cold start and data sparsity in recommendation systems, an increasing number of researchers are integrating item features, resulting in the emergence of multimodal recommendation systems. Although graph convolutional network-based approaches have achieved significant success, they still face two limitations: (1) Users have different preferences for various types of features, but existing methods often treat these preferences equally or fail to specifically address this issue. (2) They do not effectively distinguish the similarity between different modality item features, overlook the unique characteristics of each type, and fail to fully exploit their complementarity. To solve these issues, we propose the user perception-guided graph convolutional network for multimodal recommendation (UPGCN). This model consists of two main parts: the user perception-guided representation enhancement module (UPEM) and the multimodal two-step enhanced fusion method, which are designed to capture user preferences for different modalities to enhance user representation. At the same time, by distinguishing the similarity between different modalities, the model filters out noise and fully leverages their complementarity to achieve more accurate item representations. We performed comprehensive experiments on the proposed model, and the results indicate that it outperforms other baseline models in recommendation performance, strongly demonstrating its effectiveness.

1. Introduction

The swift advancement of the internet has facilitated the accessibility of multimedia content in our daily lives. Consequently, the extensive volume of multimedia content, particularly within e-commerce enterprises, has led to the problem of information overload. Recommendation systems are essential for efficiently using the vast array of multimedia content in e-commerce, assisting consumers in identifying products or services of interest among millions of offerings. Collaborative filtering (CF) is a traditional method in recommendation systems that generates personalized suggestions by analyzing the similarities in user behaviors. It has been thoroughly advanced in the domain of recommendations. For instance, Bayesian Personalized Ranking (BPR) [1], a widely recognized collaborative filtering technique that enhances user representation and item representation through matrix factorization, utilizes a BPR loss for optimization. The method of neural graph collaborative filtering (NGCF) [2] integrates user–item interactions into the embedding process, connecting users and items through higher-order connectivity, and utilizes graph convolutional networks (GCNs) to extract valuable information from each layer. MM-Rec [3] combines textual and visual information from news articles to learn multimodal news representations. LightGCN [4] enhances the information aggregation technique of the original GCN, simplifying the model while augmenting recommendation performance.
Collaborative filtering’s advantage rests in its independence from prior knowledge regarding item or user attributes, as it utilizes user behavior information to generate recommendations. However, it has other limitations and challenges, such as the cold start problem and data sparsity. As e-commerce continues to advance rapidly, the accessibility of multimodal item features (including images, text, and videos) has surged. Numerous researchers have integrated these multimodal features into recommendation methods as supplementary information for items, effectively addressing the cold start issue and data sparsity. Consequently, multimodal recommendation methods that incorporate multimodal features into conventional collaborative filtering methods have arisen. Early multimodal recommendation methods [4,5] regarded multimodal features as supplementary information for items, with the objective of mining users’ preferences for those items.
For example, Visual Bayesian Personalized Ranking (VBPR) [5] employs a convolutional neural network (CNN) to derive visual features from images, subsequently integrating these features with item ID embeddings to create a complete item representation. The extensive utilization of graph neural networks (GNNs) in recommendation methods [6,7,8,9,10,11,12] has led to the adoption of GNNs in multimodal recommendation methods to improve the performance of multimodal recommendation. For instance, MMGCN [13] employs three distinct modalities of micro-video information (visual, acoustic, and textual) to construct user–item interaction graphs across various modalities. Graph convolutional networks are then utilized for message passing to derive embedding representations of users and items in these modalities, which are then concatenated to produce the final user preference prediction. GRCN [14] dynamically modifies and enhances the configuration of the user–item interaction graph according to the model training status, thereby removing extraneous edges and acquiring more pertinent information regarding user preferences. EgoGCN [15] uses edgewise modulation (EGO) fusion technology to adaptively transfer multimodal information on the boundary, solving the problem of insufficient fusion or noise information in existing methods. EgoGCN is different from other graph fusion methods because it improves the transfer of information between modalities by changing the boundary information of local modalities. This keeps the processing within modalities and lets information flow between multiple modalities.
DARGON [16] learns dual representations of users and items by constructing a multimodal recommendation homogeneous graph, thus enhancing the binary relationship. MARIO [17] predicts users’ preferences by considering the individual impact of each modality on each interaction, while obtaining item embeddings that preserve the inherent modal-specific properties. MKGCN [18] builds three aggregators, where the multimodal aggregator aggregates the multimodal features of the item into a multimodal knowledge graph, the user aggregator and item aggregator use graph convolutional networks to aggregate multimodal knowledge graph nodes of multihop adjacent nodes, respectively, to model user preferences and item high-order representations. Finally, the recommendation is made using the aggregated embedding representation. PMGT [19] is the pre-trained model that leverages the fused multimodal feature and interactions. The model is pre-trained by setting two graph reconstruction tasks, graph structure reconstruction and masked node feature reconstruction, as the learning objectives. TMKG [20] utilizes multitask learning to end-to-end integrate the two-side information of the trust graph and knowledge graph, in order to capture implicit relationships in a more fine-grained manner. DMRL [21] can capture the different modalities of users’ attention to each factor in user preference modeling and adopts a decoupled representation technique to ensure that the features of different factors in each modality are independent of each other. MEGCF [7] captures the semantic relevance between user and item interactions and richer multimodal features by constructing a symmetric linear graph convolutional network, thereby achieving the synchronous capture of higher-order multimodal semantic relevance and collaborative signals. LUDP [22] uses modal information to construct modal similarity graphs, so as to mine potential relationships between items with similar patterns and learn deeper user preferences based on the characteristics of the different modes of interaction the user has had with items in the past.
Furthermore, certain studies construct auxiliary graphs based on user or item relationships. For instance, DualGNN [23] develops a user co-occurrence graph to extract more precise representations of users by leveraging the interrelations across users. LATTICE [24] constructs item–item relation graphs for each modality utilizing multimodal features, subsequently fusing these graphs to derive a latent item–item graph, resulting in a more holistic latent item representation and augmenting recommendation performance. FREEDOM [25] is based on LATTICE, where the latent item–item graph updates are frozen, and a degree-sensitive edge pruning method is applied to reduce the noise in the interaction graph between users and items. Other approaches employed attention mechanisms to extract user preferences for items’ multimodal features, hence enhancing recommendation performance. MGAT [26], built upon the MMGCN framework, uses the standard graph convolutional network (GCN) for aggregation and combines the aggregated results. To improve information flow across different modalities, it introduced a new gated attention mechanism.
Despite the notable achievements of current multimodal recommendation methods, many limitations persist. Primarily, some approaches utilize the multimodal item features directly as supplementary information for evaluating user preferences [5,14,23]. In reality, the influence of these features on user preferences are variable. To address this issue, some methods use multimodal features as node representations. For instance, MMGCN [13] builds the user–item interaction graphs for each modality to derive the modality-specific representations of users and items, subsequently merging the GCN outputs from different modalities to achieve the final representations. Despite employing analogous methodologies that consider the user’s individual interests across several modalities, they just utilize GCNs within parallel interaction graphs and uniformly process all nearby information, failing to adaptively capture user modality preferences and ignoring the mutual relationships between different modalities, while neglecting the similarity and complementarity between the features of items in different modalities in the embedding vector fusion.
In this paper, we present a user perception-guided graph convolutional network for multimodal recommendation, referred to as UPGCN. Specifically, in the UPGCN model, we first use the multimodal features of items as input and perform graph convolution on the user–item interaction graph to extract user representations under different modalities. Then, we further use the output user representations and the user–item interaction graph to mine the item representations. To accurately assess the potential impact of item features across various modalities on user preferences, we developed a user perception guided representation enhancement module (UPEM), which utilizes user representation with modal preferences as a guiding signal to augment the user ID Embedding. Additionally, to effectively utilize the similarity and complementarity between different modalities of the items’ features, a two-step enhanced fusion method was designed, in which the original modalities of item features and the modalities of item features with user modal preferences were used as enhanced signals to enhance the item ID Embedding. Finally, the enhanced user ID Embedding and item ID Embedding are learned through LightGCN to obtain the final user preference prediction. In addition, we chose graph convolutional networks (GCNs) as the basis of our method because of their effectiveness in handling graph structured data, especially in modeling user–item interactions in recommendation systems. GCNs can capture complex user–item relationships by aggregating information from neighboring nodes, thereby improving the accuracy of recommendations. The main contributions of our work can be outlined as follows:
  • A user perception-guided graph convolutional network for multimodal recommendation is proposed, which combines the multimodal features of items into the GCN model and utilizes the user–item interaction graph and GCN to extract the modal preferences of users better. It obtains a more accurate user representation and effectively fuses the item representation in different modalities.
  • A user perception-guided representation enhancement module (UPEM) is developed, which first utilizes the different modal features of the item as inputs, performs graph convolution on the user–item interaction graph to obtain user representations in different modalities, and then utilizes the user representation as a guidance signal to augment the embedding of user ID, thus obtaining a more accurate user representation.
  • A multimodal two-step enhanced fusion method is proposed, which extracts useful information from item features with user modality preference and the original item’s multimodal features for enhancing item representation.
  • We performed comprehensive experiments on three public datasets to demonstrate the effectiveness of our proposed model.
The rest of this paper is organized as follows. We introduce related work in Section 2. Section 3 presents a model of the user perception-guided graph convolutional network for multimodal recommendation. In Section 4, we provide a detailed explanation of the experimental design, including the datasets used, evaluation metrics, and parameter settings, along with comparative experiments and results analysis against baseline methods. Additionally, ablation experiments and their results analysis, as well as experiments related to hyperparameters, are also conducted. Section 5 summarizes the model and method we have proposed and points out the future research goals.

2. Related Work

2.1. Deep Learning-Based Multimodal Models

With the popularity of deep learning, many multimodal recommendation approaches based on convolutional neural networks (CNNs) have been developed. For instance, DeepCoNN [27] introduces a novel deep learning architecture that leverages user and item reviews to improve recommendation accuracy. It utilizes two parallel CNNs, one for users and one for items, to extract features from review text, which are then combined through a shared layer to model user–item interactions. This approach addresses the data sparsity issue common in CF by incorporating rich semantic information from reviews. DVBPR [28] utilizes a Siamese CNN framework to extract visual features of items from its images, and these features serve as item representations for user preference prediction. The item representation generated by the CNN architecture can be trained end-to-end with the recommendation system. MRLM [29] addresses the challenges of heterogeneous and multimodal item descriptions in the Internet of Things (IoT) environment. It integrates global feature representation learning (GFRL) and multimodal feature representation learning (MFRL) to jointly capture user–item interactions and multimodal features. Additionally, to capture users’ different preferences for multimodal features, some multimodal recommendation models also use attention mechanisms. For example, VECF [30] addresses the problem that different users may have different focus points on the same item by using the VGG model to pre-segment the item image and employing attention mechanisms to capture the users’ attention on different image regions, which are used to enhance item representation in the image modality. ACF [31] incorporates component-level and item-level attention mechanisms into traditional collaborative filtering models to allocate attentive weights for deducing the underlying user preferences reflected in implicit feedback. Deep learning-based models have excellent recommendation performance, but these models mostly tend to focus more on the modal information of items and neglect the rich interaction information between users and items.

2.2. Graph-Based Multimodal Models

Graph neural networks (GNNs) have proven effective in learning representations for graph data across various domains [32,33,34]. The core concept of GNNs is to update the representation of the current node by aggregating the representations of its neighboring nodes during the propagation process. Recently, more and more researchers have introduced GNNs into recommendation systems. In collaborative filtering recommendations, there are only two entities: users and items. The relationships between them can be represented by an interaction graph, allowing for the application of GNNs in this recommendation context. Specifically, graph-based approaches represent users and items by transforming user interaction histories into an interaction graph between users and items.
In order to better utilize multimodal information and improve recommendation performance, multimodal recommendation methods have emerged. MMGCN [13] constructs user–item interaction graphs across multiple modalities, then employs GNN to combine information from neighboring nodes, fusing the user and item representations across multiple modalities, thereby enhancing the understanding of user preferences. Based on MMGCN, DualGNN [23] obtains user representations under different modalities through GNN and then establishes a user co-occurrence graph to capture users’ modal preferences across different modalities. By identifying false-positive edges and noisy edges, GRCN [14] has created a graph refining layer that further refines the interaction graph between users and items, therefore, preventing the propagation and aggregation of noise in the GNN. However, the aforementioned method overlooks the relationships between items. To more accurately obtain the relationship between items, LATTICE [24] establishes the item–item graphs for different modalities and then integrates them to obtain a latent item–item graph. Additionally, during the backpropagation process to optimize parameters, the item–item relationship graph will be updated. MICRO [35] is an extension of LATTICE, that proposes a contrastive modality fusion method to solve the inconsistency problem in multimodal information fusion by combining contrastive learning and latent structure mining. FREEDOM [25] finds that learning the potential structure was inefficient, so it freezes the updates of the item–item graphs across various modalities, allowing for better item semantic representation and improving the algorithm’s efficiency. BM3 [36] utilizes self-supervised learning methods to solve the negative sample sampling problem and the complexity problem of large-scale graph computation in multimodal recommendation. The model generates a contrastive view by using a simple dropout data augmentation technique, replacing the existing graph structure enhancement methods and thus reducing the computational and memory overhead. Following this line of research, we utilize graph convolutional networks and multimodal features to initially obtain the users’ modality preferences and then adaptively learn user representations based on these preferences to achieve more accurate user preferences and improve recommendation performance.

2.3. Multimodal Fusion

For multimodal recommendation systems, finding a multimodal fusion method that excludes modality similarity and can achieve information complementarity between modalities is crucial for enhancing recommendation performance. Multimodal fusion methods mainly include early fusion, hybrid fusion, and late fusion [37]. Early fusion refers to the process of merging the modal features extracted at the beginning and then inputting this combined feature representation into the model for processing. On the contrary, late fusion involves merging after the model has made decisions for each modality. Hybrid fusion combines these two methods. For example, UVCAN [38] uses an early fusion method based on attention mechanisms. In the late fusion approach, SLMRec [39] presents an innovative self-supervised learning framework for multimedia recommendation, leveraging graph neural networks alongside three novel self-supervised tasks: feature dropping, feature masking, and fine- and coarse-grained feature learning. By implementing these tasks, SLMRec generates multiple views that facilitate contrastive learning, thereby uncovering implicit relationships across different modalities and performing late fusion on the generated item views. CELFT [40] has developed a hybrid fusion method where fusion is performed both before and after modal features are encoded, combining early and late fusion together. However, these fusion methods only concatenate or add the item representations obtained from different modalities, failing to effectively distinguish the similarities among modal features and fully leverage their complementarity. Therefore, we propose a two-step enhanced fusion method that takes a strategy of first enhancing and then fusing item representations to effectively distinguish the similarity between multimodal features and fully utilize their complementarity.

3. Methods

In this section, we first provide some descriptive information about the proposed model and then introduce the model we propose, which consists of three main aspects: (a) the user perception-guided representation enhancement module, (b) multimodal two-step enhanced fusion method, and (c) the loss function used.

3.1. Relevant Description

3.1.1. User–Item Interaction Graph

In our research, we have users U = { u 1 , u 2 u X } and items I = { i 1 , i 2 i Y } , where X is the number of users and Y is the number of items. We formalize the user–item interaction graph through the following formula:
G = u , y u i , i u U , i I
where U denote the user set and I the item set, while y u i signifies the interaction between a user and an item. The equation defining y u i is presented below:
y u i = 1 ,   i f   u   i n t e r a c t e d   w i t h   i ;   0 ,   o t h e r w i s e .
When a user interacts (such as purchasing, liking, or browsing) with an item, y u i = 1 , otherwise, y u i = 0 . Follow the mainstream approach [13,24], we also embed the users and item in the interaction graph as ID embeddings e u , e i R d , where e u denotes the user ID embedding, e i denotes the item ID embedding, and d denotes the dimension of the embedded vector.

3.1.2. Item–Item Graph

Following the approach outlined in [24], we utilize the k-nearest neighbors (kNN) method to construct an item–item graph S m for modality m based on the original modal features. Initially, we assess the similarity between items by calculating the similarity of their respective original features, which results in the formation of a similarity matrix S m R N × N representing the item–item graph. The cosine similarity function is employed to compute the values within the similarity matrix S m :
S a b m = ( x a m ) T x b m x a m x b m
where S a b m refers to the similarity between item a and b in modality m . x a m denotes the original features of item a in modality m , and x b m denotes the original features of item b in modality m . To obtain more precise node representations and more relevant neighboring features in the execution of graph convolution procedures, we will use the kNN sparse method [41] to sparse the original adjacency matrix S m , which means that for each item, only the top-k edges are retained:
S ^ a b m = 1 ,   S a b m top-k ( S a m ) , 0 ,   o t h e r w i s e .
where S ^ a b m is the element of the resulting sparse matrix S ^ m . Different from [24], we do not use the similarity between items as the element value of the sparse matrix. Instead, we follow the approach of [25] and set the top-k highest-correlated edges to 1, while setting the remaining lower-correlated or unrelated edges to 0. Furthermore, we normalize the discretized matrix S ^ m to avoid the gradient explosion problem:
S ~ m = ( D m ) 1 2 S ^ m ( D m ) 1 2
where D m R N × N is the diagonal degree matrix of S ^ m and D a a m = b S ^ a b m . At this point, we have obtained the normalized item–item graph S ~ m for m modality. The next step is to aggregate the item–item graphs across various modalities to obtain the latent item–item graph. Following [25], we also freeze the updates of the latent item–item graph. To be specific, we have modal set M = v , t where v and t , respectively, denote the visual modality and textual modality. The latent item–item graph can be derived from the normalized item–item graphs for different modalities using the following formula:
S = α v S ~ v + α t S ~ t
where S R N × N , α v , and α t are the weight coefficients for the visual modality and textual modality, respectively. Additionally, we set α t = 1 α v , and based on the findings of [25], we empirically set α v to 0.1 .

3.2. Overview of the Proposed Model

Figure 1 illustrates the comprehensive structure of the UPGCN model. UPGCN primarily consists of the UPEM and the multimodal two-step enhanced fusion method, which work together to achieve more precise representations of users and items. First, the enhanced signal is obtained by enhancing the item ID embedding using the transformed multimodal features of the item as the enhancing signal, thus obtaining e ¯ v and e ¯ t . Then, the user representation is obtained by performing graph convolution on the user–item interaction graph using the multimodal features of the item as input, obtaining h u v and h u t . Subsequently, the item representation is obtained by performing graph convolution on the same interaction graph using the user representation h u v , h u t as input, obtaining h i v and h i t in different modalities. These item representations, along with the enhanced item ID embeddings e ¯ v and e ¯ t , are fed into the enhancement module for further processing and fusion, producing e ¯ i . Then, use the item representations h i v and h i t from different modalities as inputs, and perform graph convolution on the item–item graph in different modalities to obtain preliminary disentangled item representations h ~ i v and h ~ i t with reduced similarity. Finally, e ¯ i , and the item representation h ~ i v , h ~ i t obtained in different modalities are fused, and the final item representation e ~ i generated by the multimodal two-step enhanced fusion method is output. Furthermore, to effectively extract user preference information across different modalities, the user representations h u v and h u t derived from these modalities serve as guiding signals for enhancing the user ID embedding. An attention mechanism is employed to quantify the user’s focus on each modality’s information, which is subsequently integrated into the user ID embedding for subsequent embedding learning. Building upon this framework, embedding learning is conducted on the user–item interaction graph to derive final representations of users and items that facilitate accurate predictions of user preferences.

3.3. User Perception-Guided Representation Enhancement Module

To effectively leverage the multimodal features of the item and mine user preferences across various modalities, we developed the user perception-guided representation enhancement module, which enriches user representations and enhances recommendation performance.
First of all, given that the original modal features of items vary in dimensionality, a multilayer perceptron (MLP) is employed to map these different modal features into a unified, low-dimensional space:
h m = x m W m + b m
where W m R d m × d and b m R d represent the learnable transformation matrix and bias in the MLP, respectively. Here, d m denotes the dimensionality of the item features in m modality, while d corresponds to the dimension of the embedding vector. After applying the dimensionality transformation, all modal features are in the same dimension as the embedding vector.
After acquiring the modal features of the item following dimensionality transformation, we derive the user features of different modalities by aggregating the item modal features on the user–item interaction graph. Analogous to MMGCN [13], we employ the item modal features to conduct GCN on the interaction graph between users and items in order to capture the user features across different modalities. Similarly, upon acquiring the user modal features, these features can be employed as inputs to conduct GCN on the same interaction graph in order to derive the item features of different modalities. This approach effectively leverages user modality information and integrates it with item modality data, thereby facilitating the capture of user preferences across various modalities. Additionally, LightGCN [4] posits that the feature transformation and nonlinear transformation inherent in traditional GCN do not yield significant benefits for collaborative filtering tasks. Building on this premise, it streamlines the graph convolution operation for multimodal recommendation. In alignment with LightGCN [4], we eliminated the circular propagation of self-information, as well as the nonlinear transformation and feature transformation of neighboring nodes during the graph convolution operation. This approach aims to simplify the model while enhancing both inference speed and training efficiency. Consequently, the GCN operation can be defined as follows:
h u , m ( l + 1 ) = i N u 1 N u N i h i , m ( l )
h i , m ( l + 1 ) = u N i 1 N u N i h u , m ( l )
where N u and N i denote the neighborhoods of u and i within the interaction graph between users and items, respectively. h u , m ( l ) R d and h i , m ( l ) R d , respectively, denote the representations of users and items derived from the modalities m in the preceding layer. It is crucial to emphasize that h i , m ( 0 ) = h m . Following the acquisition of user representations across various modalities, the subsequent task involves inputting these representations into the UPEM to extract users’ preferences for different modalities of information and to incorporate this preference data into the embedding of user ID. Initially, based on the user representations for each modality, the attention weights corresponding to the user’s preferences across different modalities are computed using the following formula:
γ u m = s o f t m a x ( q m T t a n h ( V m H m u + b m ) )
where q m R d is the attention vector, and V m R d × d and b m R d , respectively, represent the transformation matrix and bias vector; all of these are learnable parameters within a particular modality. Consequently, the attention weights across different modalities remain independent, indicating that the parameters between modalities do not share weights. H m u R d × U is the user representation matrix in modality m, which is obtained by concatenating the h u , m ( l + 1 ) values for each user as derived from Equation (9). Then, by multiplying with the attention weights γ u m , the embedding of user ID e ^ u ( 0 ) that incorporates the user’s preference information is obtained:
e ^ u ( 0 ) = m M γ u m e u ( 0 )
where M = { v , t } is the set of modalities, and v and t represent the visual modality and textual modality, respectively.
Finally, e ^ u ( 0 ) is normalized prior to executing the residual connections; an initial user representation is obtained:
e ¯ u ( 0 ) = e ^ u ( 0 ) e ^ u ( 0 ) 2 + e u ( 0 )

3.4. Multimodal Two-Step Enhanced Fusion Method

To effectively differentiate the similarities among multimodal features while fully utilizing their complementarity, we propose the multimodal two-step enhanced fusion method. This approach employs a strategy of pre-enhancement followed by fusion, wherein the embedding of item ID is initially enhanced based on modal features across different modalities to separate the similarities between these diverse modal features. The second enhanced signal is derived from Equation (9), which aims to enhance the item ID embedding by utilizing the item modality information h i , m ( l + 1 ) after fusing the user modality information. In summary, the enhancement process involves extracting modality-specific information from both the original item modal features and the item modal features that incorporate user modality features, subsequently reflecting this information in the embedding of item ID.
Specifically, after deriving the modal feature h m with dimensions matching those of the ID embedding using Equation (7), we utilize this modality feature h m as a guiding signal to enhance the initial embedding of item ID e u ( 0 ) , thereby distinguishing similarities between modalities and emphasizing the unique characteristics inherent to each modality. The calculation equation is as follows:
e ¯ m = e i ( 0 ) σ ( h m W i d i t e m + b i d i t e m )
where W i d i t e m R d × d and b i d i t e m R d represent the transformation matrix and bias, respectively, both of which are learnable parameters, σ denotes the sigmoid function, while represents the element-wise product. Guided by modal feature h m , the initial embedding of item ID e i ( 0 ) was enhanced to yield an improved embedding of item ID e ¯ m .
Following the calculation of user modal information h u , m using Equation (7) and given that we continue to utilize h u , m as the neighbor node, the item modal information h i , m ( l + 1 ) obtained through the GCN aggregation of neighboring nodes contains certain user preference information. To fully leverage this information, we opt to enhance e ¯ m using item modal information h i , m ( l + 1 ) after the initial enhancement step. The enhancement process is analogous to the methodologies presented in Equations (10) and (11), wherein h i , m ( l + 1 ) serves as the input for computing the attention weights γ i m , followed by a weighted summation of e ¯ m to derive e ¯ i :
e ¯ i = m M γ i m e ¯ m
To further distinguish the similarities between items across different modalities and facilitate subsequent multimodal fusion, we apply graph convolution to the item–item graph across various modalities established in Section 3.1.2, thereby obtaining a more precise representation of an item within a single modality. The graph convolution executed on the item–item relationship graphs across various modalities is defined as follows:
h ~ i , m ( l + 1 ) = j N i S ~ i j m h ~ j , m ( l )
where N i denotes the set of all neighboring nodes for item i , h ~ j , m ( l ) signifies the node representation of item j within the graph convolutional network at layer l in modality m , and h ~ j , m ( 0 ) is derived from Equation (9) as the item modality information h i , m .
To fully leverage the complementarity inherent in multimodal features, the subsequent step involves effectively integrating item information from various modalities to derive a unified item representation, thereby facilitating subsequent embedding learning and further uncovering user preferences.
Initially, we integrate the item representations h ~ i , m derived from Equation (15) across different modalities. To achieve this, we employ an attention mechanism to perform self-enhancement on the item representation of each modality, thereby emphasizing the distinctive features of individual modalities and adaptively adjusting the importance of each modality to facilitate a more accurate item representation of in subsequent embedding learning. The definition of self-enhancement and the fusion process for multimodal item representations is as follows:
ε i m = s o f t m a x ( q i , m T t a n h ( W i , m h ~ i , m + b i , m ) )
h ~ i = m M ε i m h ~ i , m
where W i , m R d × d and b i , m R d denote the transformation matrix and bias, respectively, q m R d is the attention vector, and ε i m denotes attention weight for modality m , which reflects the importance level of modality m in item i as derived from the multimodal representation. We employ Equation (17) along with this attention weight to perform a weighted fusion of the modality representation h ~ i , m for the item. Following the acquisition of the self-enhanced multimodal item representation h ~ i (Equation (17)) and the enhanced item behavior representation e ¯ i (Equation (14)), the subsequent step involves fusing these representations to produce an initial item representation e ~ i used for embedding learning. The fusion process is defined as follows:
e ~ i = λ i e ¯ i + ( 1 λ i ) h ~ i
where λ i represents the fusion weight, serving as a hyperparameter within UPGCN, and its value determines the balance between e ¯ i and h ~ i during the final fusion process.

3.5. Model Prediction

Following the completion of module propagation as outlined in Section 3.3 and Section 3.4, we acquire the initial user representation e ¯ u ( 0 ) (Equation (12)) and the initial item representation e ~ i (Equation (18)). Subsequently, we utilize the initial user representation e ¯ u ( 0 ) along with the initial embedding of item ID e i ( 0 ) to conduct graph convolution on the interaction graph between users and items, thereby obtaining the final user representation y u and the transitional item representation e i . Subsequently, we will integrate e i and e ~ i (Equation (18)), with the integration process defined as follows:
e ¯ i = e ~ i e ~ i 2 + e i
Finally, we utilize e ¯ i as input to apply GCN on the latent item–item graph S , constructed in Section 3.1.2, employing the GCN operation defined as follows:
e ^ i ( l + 1 ) = j N i S i j e ^ j ( l )
where N i denotes the set of all neighboring nodes for item i , e ^ i ( 0 ) is initialized as e ¯ i , and the item–item latent graph S output item representation is denoted as e ^ i . The final item representation y i is obtained by fusing e ^ i with the transitional item representation e i output from the user–item interaction graph. The integration operation defined as follows:
y i = e ^ i + e i
At this point, we have obtained the final user representation y u and the final item representation y i for prediction. To generate suggestions for users, we first compute the inner product between y u and y i to evaluate the affinity between the user and the candidate items:
s ( y u , y i ) = ( y u ) T y i
Subsequently, the candidate items are ordered in descending fashion according to the predicted affinity scores, and the top K items are selected as recommendations for the user.

3.6. Optimization

To optimize the recommendation task parameters in UPGCN, we utilize the BPR loss [1] as the primary optimization objective, which assumes that users are more inclined to select items they have previously interacted with rather than those they have not:
L b p r = ( u , i , j ) D l o g σ ( y u T y i y u T y j )
where D = { ( u , i , j ) | ( u , i ) P , ( u , j ) N } denotes the triplet instances within the training set, u represents a user, i is an item the user u has interacted with, and j is an item the user u has not interacted with. The set P includes positive samples, while N contains negative samples, and σ represents the sigmoid function.
To effectively preserve the modal features that have a significant influence on user preferences, we developed a multimodal BPR loss aimed at optimizing the parameters utilized in the modal feature extraction process (Equation (7)):
L m m b p r = ( u , i , j ) D m M l o g σ ( y u T h m i y u T h m j )
where h m i denotes the modal features of positive samples and h m j denotes the modal features of negative samples. Ultimately, combining the BPR loss, we derive the final loss function:
L = L b p r + λ r e g L m m b p r
where λ r e g is a hyperparameter in UPGCN that serves to weight the significance of the multimodal BPR loss in L .

4. Experiments

4.1. Datasets

We select three categories from the Amazon review dataset [42] for our experimental evaluation: (a) Baby, (b) Sports and Outdoors, and (c) Clothing, Shoes and Jewelry, abbreviated as Baby, Sports, and Clothing. These datasets provide both textual and visual modalities for items and are publicly available. The size of the datasets for different item categories varies. The statistical information for these three datasets is presented in Table 1, where data sparsity ρ is defined as follows:
ρ = 1 A U × I × 100 %
where U , I , and A denote the number of users, the number of items, and the number of interactions, respectively. Regarding modality features, we utilize the pre-extracted 4096-dimensional visual features and 384-dimensional text features that are available in [24].

4.2. Baselines

To demonstrate the effectiveness of the proposed model, we compare it with several representative baseline recommendation models in two categories. The first category consists of general collaborative filtering (CF) models, which rely solely on historical interaction relationships between users and items to generate recommendations. The second category encompasses several representative multimodal models that leverage both historical interaction information and the multimodal features of items to provide recommendations for users.
1.
General CF Models:
  • BPR [1] is a classic collaborative filtering method that optimizes the latent representations of users and items within a matrix factorization (MF) framework through a BPR loss;
  • LightGCN [4] is a simplified graph convolution network that omits feature transformation and nonlinear activation layers, focusing on linear propagation and neighbor aggregation.
2.
Multimodal Models:
  • VBPR [5] employs the CNN to obtain visual features from product images, subsequently integrating these visual features and ID embeddings of each item as its representation.
  • MMGCN [13] constructs the user–item interaction graphs for different modalities based on the features of different modalities and user–item interaction information. It subsequently employs graph convolutional on this graph to obtain user and item representations in different modalities. The user and item representations utilized for the final recommendation are derived from concatenating the user and item representations learned from the above steps across different modalities.
  • GRCN [14] has designed a graph refining layer that further refines the interaction graph between user and items by detecting false-positive edges and noisy edges, thereby preventing the propagation and aggregation of noise in the GNN.
  • DualGNN [23] initially extracts user representations from the user–item interaction graph across various modalities, subsequently integrates these representations, and constructs an additional user co-occurrence graph.
  • LATTICE [24] constructs distinct modality-specific item–item graphs and obtains a latent item–item graph by integrating item–item graphs from all modalities.
  • SLMRec [39] is a self-supervised learning framework for multimedia recommendation, which designs three different granularity data augmentation methods to build auxiliary tasks for contrastive learning.
  • FREEDOM [25], based on the LATTICE framework, freezes the graph prior to training and employs a degree-sensitive edge pruning method to reduce noise in the interaction graph between users and items.

4.3. Evaluation Metrics

We employed the widely recognized metrics Recall@K and NDCG@K, abbreviated as R@K and N@K, to assess the recommendation effectiveness of UPGCN and baseline models. In the experiments, we present empirical results for K values of 10 and 20 on the test set.
Recall@K is a metric in recommendation systems that evaluates the proportion of actual relevant items in the top-K recommended items. It measures the model’s effectiveness in retrieving items that a user is likely to find interesting among the initial K suggestions. The formula for calculating Recall@K is presented below:
Recall @ K = R e l u R e c u R e l u
where R e l u denotes the set of items that user u has interacted with, R e c u denotes the set of the top K recommended items for user u .
NDCG@K is a metric employed by recommendation systems to assess the quality of the top-K recommended items. The first step involves calculating the Discounted Cumulative Gain (DCG), which is based on the rankings and relevance scores of pertinent items found in the recommendations. The formula is as follows:
D C G u @ K = i = 1 K 2 r ( i ) 1 log 2 ( i + 1 )
where r ( i ) denotes the relevance score of the i -th recommended item. If the recommended list matches item i , then r ( i ) is 1, otherwise it is 0. Subsequently, the calculation of I D C G u @ K will be conducted:
I D C G u @ K = i = 1 K 1 log 2 ( i + 1 )
The definition of NDCG@K is as follows:
N D C G u @ K = D C G u @ K I D C G u @ K
N D C G @ K = u U t e N D C G u @ K U t e
where U t e is the set of all users in the test dataset, and U t e is the number of users in the test dataset.

4.4. Parameter Setup

For a fair comparison, the embedding size was fixed at 64 across all models, with the embedding and network parameters initialized using the Xavier method [43], while Adam [44] served as the optimizer. Following the existing methods [24,25], we fixed the number of GCN layers in the user–item interaction graph and item–item graph at L u i = 2 and L i i = 1 , respectively. Additionally, the optimal hyperparameters were identified through grid searches conducted on the validation data, specifically focusing on the fusion weight λ i in { 0.35 ,   0.40 ,   0.45 ,   0.50 ,   0.55 } and the weight of multimodal BPR loss λ r e g in {1 × 10−5, 1 × 10−4, 1 × 10−3, 0.01, 0.1}. We set the early stopping and total epochs to 20 and 1000, respectively, utilizing Recall@20 on the validation data as the indicator for terminating training.

4.5. Performance Comparison with Baselines

Table 2 illustrates a comparison of the recommendation performance between the presented methods and our proposed model, UPGCN. Specifically, we used FREEDOM as the baseline method for comparison. UPGCN made an improvement in effectiveness compared to FREEDOM, which is expressed as a percentage of the improvement. The table reveals the subsequent observations:
The UPGCN model outperforms the presented models in terms of recommendation performance on each dataset. Specifically, UPGCN improves FREEDOM by 4.31%, 3.21%, and 2.38% in terms of R@10 on Baby, Sports, and Clothing, respectively. This indicates that our proposed model is well-designed for the multimodal recommendation. Specifically, to capture users’ preference information for different modalities, we adopted a user-perception-guided representation enhancement module to obtain more accurate user representations. Additionally, by using a multimodal two-step enhanced fusion method, we separated the similarity between different modalities and highlighted the features of a single modality, thus avoiding contamination from modality noise and fully utilizing their complementarity to further enrich the item representation.
The integration of multimodal information into the recommendation system has substantially improved the model’s effectiveness in generating recommendations. For example, VBPR, which utilizes visual information of items, achieves a 15.3%, 31.1%, and 36.9% performance improvement in terms of R@20 compared to BPR on the Baby, Sports, and Clothing datasets, respectively. Furthermore, graph-based models (e.g., LightGCN) have also achieved improved recommendation performance by combining multimodal information. For example, DualGNN uses LightGCN to aggregate node information in the interaction graph between users and items and the user co-occurrence graph in each modality to obtain user and item representations, which have better recommendation performance than LightGCN. Moreover, most graph-based models have better performance than general models (e.g., BRP). However, as shown in Table 2, the MMGCN based on LightGCN has not achieved better performance than LightGCN. We believe the reason for this is that MMGCN directly fuses the user and item representations in different modalities, which summarizes all modalities’ features but ignores the important contribution of a single modality to improving performance. Meanwhile, multimodal information usually has similarities, so how to distinguish these similarities to highlight the single-modality features is crucial. Therefore, we designed a two-step enhanced fusion method to distinguish the similarities among multimodal features and fully utilize their complementarity. As a result, the model we proposed, UPGCN, has attained best recommendation performance.
User modal preferences exert a substantial influence on recommendation performance. Building upon MMGCN, DualGNN constructs a user co-occurrence graph to extract user preferences across various modalities, thereby significantly enhancing model’s performance. However, some models overlook the critical role of user preference information in a specific modality for mining more accurate user representations. LATTICE constructs item–item graphs in different modalities and a latent item–item graph to augment item representations in each modality. FREEDOM introduces improvements to LATTICE by freezing the item–item graph and denoising the user–item interaction graph. The aforementioned models primarily emphasize methodological enhancements to improve item representations; however, it inadequately accounts for user preferences in different modalities, resulting in the model’s failure to accurately capture user representations. Consequently, we developed UPEM in response to this issue, aiming to extract users’ preference information across various modalities to enhance their representations. As shown in Table 2, our model exhibits competitive recommendation performance, thereby validating the effectiveness of the proposed method. However, if the extraction of multimodal features from the items is insufficient, the recommendation performance of our proposed UPGCN may be weaker than that of other state-of-the-art approaches. This is because we need to use the modal features to obtain user modal representations across different modalities, thereby capturing users’ preferences for various modalities. If the initially extracted multimodal features are not accurate enough, it may affect the subsequent embedding learning. In contrast, other methods primarily use modal features as auxiliary information, which makes them less susceptible to this issue.

4.6. Ablation Studies

To thoroughly examine the impacts of various factors, we conducted ablation studies on both the modules of UPGCN and their modality features.

4.6.1. Effect of Multimodal Features

We evaluated the recommendation performance of UPGCN by feeding each modal feature into the model sequentially. Specifically, we designed the following UPGCN variants:
  • UPGCNw/o-v&t: In this variant, UPGCN eliminates the input of two modalities, thereby transforming into a general recommendation model;
  • UPGCNw/o-v: This variant denotes that UPGCN utilizes exclusively the textual modality of the items;
  • UPGCNw/o-t: This variant indicates that UPGCN has eliminated the textual modality of the item, thereby utilizing solely its visual modality.
As is shown in Table 3, in all evaluation metrics (R@10, R@20, N@10, and N@20) and datasets (Baby, Sports, and Clothing), UPGCNw/o-v, which only utilizes textual features, achieved higher recommendation performance than UPGCNw/o-t, which only relies on visual features, indicating that the text modality contains more information. Furthermore, the two variants that integrate single-modal features demonstrate superior recommendation performance compared to UPGCNw/o-v&t, which does not incorporate any modal features, thereby further substantiating the significant role of multimodal features. Additionally, the modal features derived from the texts and images of products in the Sports dataset are less informative compared to those obtained from the Baby and Clothing datasets. The evidence is that the average performance improvement of UPGCNw/o-t in the Sports dataset is 7.23%, while it is 9.71% and 21.91% in the Baby and Clothing datasets, respectively. In the Clothing dataset, characterized by the highest level of data sparsity (as is shown in Table 1), the performance improvement achieved through the integration of multimodal features is particularly significant. This further suggests that incorporating multimodal features within recommendation systems can effectively alleviate the data sparsity issue.
Finally, UPGCN utilizing multimodal features demonstrates higher recommendation performance compared to the other two variants employing single-modal features, indicating that the integration of multimodal features can significantly enhance UPGCN’s recommendation performance; thus, our proposed fusion method is essential.

4.6.2. Effect of Modules

To assess the effectiveness and robustness of the proposed modules, we decoupled UPGCN. Specifically, we separated the designed modules and methods from UPGCN to investigate their impact on recommendation performance; consequently, we constructed the following variants:
  • UPGCN-B: This variant removes the UPEM and multimodal enhanced fusion method that we designed;
  • UPGCN-E: This variant only uses UPEM to enhance the user representation, and removes the two-step enhanced fusion method;
  • UPGCN-T: This variant only uses the multimodal two-step enhanced fusion method to enhance and fuse the item representations in different modalities, removing UPEM.
The comparison results for R@20 and N@20 are presented in Figure 2. By comparing the performance of UPGCN-E and UPGCN-B, it is evident that incorporating the UPEM has resulted in significant enhancements in both Recall@20 and NDCG@20 across all datasets. This indicates that UPEM has played a positive role in capturing user preference information and improving recommendation performance. Comparing UPGCN-T with UPGCN-B shows that the multimodal enhanced fusion method has also significantly improved the model’s performance in Recall@20 and NDCG@20. In particular, the multimodal enhanced fusion method has a particularly noticeable impact on improving recommendation performance in the Clothing dataset, further indicating that the method can effectively improve recommendation results on large-scale datasets.
UPGCN outperforms all other variants across all datasets regarding Recall@20 and NDCG@20, demonstrating its superior performance. This indicates that the combination of UPEM and multimodal enhanced fusion methods can synergistically enhance model performance. Moreover, the integration of these two methods is an important component of our model, which helps to better capture user preferences and effectively distinguish the similarity between multimodal features, thereby highlighting the single-modal features while fully utilizing the complementarity of different modalities to obtain a more accurate item representation and ultimately improve recommendation performance. Furthermore, in the Baby and Clothing datasets, UPGCN-T performs slightly better than UPGCN-E, indicating that the multimodal enhanced fusion method has a relatively greater impact on these two datasets. However, in the Sports dataset, UPGCN-T performs slightly worse than UPGCN-E in terms of Recall@20 and NDCG@20. We believe that one potential reason for this is that the Sports dataset extracts less information from product text descriptions and images compared to the other two datasets, and therefore, the two-step enhanced fusion method that utilizes multimodal features does not perform well in this dataset. However, despite the performance gap between UPGCN-T and UPGCN-E in the Sports dataset, their results remain relatively close to each other, further underscoring the significance of UPEM and the robustness of UPGCN.

4.7. Hyperparameter Sensitivity Study

4.7.1. Effects of the Fusion Weight λ i

In our multimodal enhanced fusion method, the first enhancement utilizes dimensionally transformed modal features and item modal features that contain user modal preference information as guiding signals to extract the parts of item features that users particularly focus on. Subsequently, the enhanced item representations across different modalities are fused. Before the second fusion, the modal features undergo self-enhancement to highlight the single-modal features, thus distinguishing the similarities between different modalities and removing noise. Then, using Equation (18), the two enhanced results, e ¯ i and h ~ i , are fused to fully leverage the complementarity of these two outcomes. Theoretically, this linear combination helps the model make full use of both types of information, thereby improving recommendation performance. As shown in Figure 3, in two datasets, by observing changes in Recall@20 and NDCG@20, when λ i = 0.45 , both Recall@20 and NDCG@20 reached their maximum values. This suggests that in the Baby and Sports datasets, a moderate fusion weight can find the optimal balance between the two types of information, thereby maximizing recommendation performance. However, when λ i = 0.35 or λ i = 0.55 , both Recall@20 and NDCG@20 slightly decreased, indicating that excessively high or low weights lead to reduced recommendation performance, indirectly suggesting that the two types of extracted information positively contribute to improving recommendation performance.

4.7.2. Effects of the Weight for Multimodal BPR Loss λ r e g

In the experiment of hyperparameter λ r e g , we systematically explored the impact of different λ r e g values on Recall@20 and NDCG@20, aiming to study the role of balancing multimodal BPR loss and standard BPR loss on recommendation performance. As shown in Figure 4, in the Baby and Clothing datasets, as λ r e g is increased from 1 × 10−5 to 0.1 in steps, the changes in Recall@20 and NDCG@20 exhibit a complex nonlinear trend, reflecting the model’s sensitivity to the weight of the multimodal loss.
In the Baby dataset, the Recall@20 and NDCG@20 metrics show a noticeable fluctuation within a small range of λ r e g . Initially, Recall@20 is close to 0.1035 when λ r e g is 1 × 10−5, and both Recall@20 and NDCG@20 slightly decrease as λ r e g increases to 1 × 10−3, indicating that the model’s dependence on multimodal features has not been fully stimulated at this point. When λ r e g is increased to 1 × 10−2, both Recall@20 and NDCG@20 reach relatively high values simultaneously, with Recall@20 at 0.1043 and NDCG@20 at 0.0451. This result shows that the introduction of multimodal BPR loss with moderate weight strength effectively enhances the model’s recommendation performance, especially when user and item features from different modalities are fully fused. However, when λ r e g is increased to 0.1, performance slightly declines again, indicating that the model may suppress multimodal features if it pays too much attention to feature extractor optimization, preventing the model from fully learning the multimodal information that is effective in predicting user preferences.
In the Clothing dataset, Recall@20 and NDCG@20 also show similar trends of change. When λ r e g = 1   ×   10 - 5 , Recall@20 and NDCG@20 are 0.0946 and 0.0423, respectively. As λ r e g increases to 1 × 10−4, Recall@20 slightly increases to 0.0959, which is the peak of Recall@20 on the Clothing dataset. For NDCG@20, the optimal value also appears at λ r e g = 1   ×   10 - 4 , with NDCG@20 reaching 0.0428. This indicates that when λ r e g is small, the model can better balance the standard BPR loss and multimodal BPR loss, allowing the model to effectively utilize multimodal features to enhance recommendation accuracy. However, when λ r e g is further increased to 1 × 10−2 and 0.1, Recall@20 and NDCG@20 both decrease, indicating that excessive optimization of the feature extractor has the opposite effect on the utilization of multimodal features.
Overall, the experimental results show that moderate λ r e g can achieve a good balance between BPR loss and multimodal BPR loss, thereby improving the recommendation performance of the model. In particular, when λ r e g is moderate, the model can fully utilize multimodal information to enhance the representation of users and items, thereby improving the recommendation accuracy. However, when λ r e g is too large, the excessively high weight of the multimodal BPR loss may lead to the excessive suppression of multimodal features, thereby reducing the learning effect of the model. Therefore, the experimental results suggest that λ r e g should be set to a medium range of values to ensure that the model can effectively integrate information across different modalities while avoiding the negative effects of excessively optimizing the feature extractor.

5. Conclusions

In this paper, we have proposed a user perception-guided graph convolutional network for multimodal recommendation. Specifically, we developed a user perception-guided representation enhancement module, which initially employs various modal features of the item as inputs, conducts GCN on the interaction graph between users and items to derive user representations across different modalities, and subsequently utilizes these representations as guidance signals to enhance the embedding of user ID, thereby obtaining a more accurate user representation by embedding learning. Furthermore, in order to effectively separate the similarity between different modalities and highlight the unique features of each modality while fully utilizing their complementarity, we design a multimodal two-step enhanced fusion method that takes a “first enhancement and then fusion” strategy for multimodal features to obtain a more accurate item representation for recommendation prediction. We conducted extensive experiments on three real-world datasets to evaluate the effectiveness of the proposed model in the recommendation task. The experimental results show that the proposed model outperforms the baseline model in terms of recommendation performance. Furthermore, we believe that UPGCN can be extended to other domains, such as the short video domain. In current collaborative filtering methods, the relationships between users and items are primarily used to identify items of interest to users. However, items typically possess multimodal features, such as images and reviews for products, as well as audio, video images, and comments in short videos. As long as the initial work effectively extracts the multimodal features of items using deep learning techniques, our model can utilize these features along with user–item historical interaction data as input, leading to effective recommendations. This is also one of the directions we aim to pursue in the future.
While UPGCN effectively uses multimodal features, it assumes that all modalities are available and of high quality. However, in real-world scenarios, certain modalities might be missing or noisy. Future work could address this limitation by developing methods to handle incomplete or unreliable multimodal data, perhaps through modality dropout or denoising mechanisms. Furthermore, the UPGCN model primarily focuses on the accuracy of recommendations but does not fully consider the efficiency of recommendations. The model’s graph convolutional structure, combined with the multimodal feature fusion, can lead to higher computational costs, particularly when scaling to very large graphs. Specifically, our model may involve more machine learning processes related to graphs, including user–item interaction graphs, item–item graphs in each modality, and latent item–item graphs, and the graph involved in real-world applications is much larger, which would result in higher computational costs when performing graph convolution operations on these large graphs. In future work, we could focus on optimizing the architecture to improve efficiency without sacrificing recommendation quality, perhaps through more lightweight graph processing techniques, pruning methods, or knowledge distillation techniques [45].

Author Contributions

Conceptualization, B.Z.; methodology, B.Z.; software, B.Z.; validation, B.Z.; formal analysis, B.Z.; investigation, B.Z.; resources, Y.L.; data curation, Y.L.; writing—original draft preparation, B.Z.; writing—review and editing, B.Z. and Y.L.; visualization, B.Z.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available in a publicly accessible repository. Dataset can be accessed at http://jmcauley.ucsd.edu/data/amazon/links.html (accessed on 4 November 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Steffen, R.; Christoph, F.; Zeno, G.; Lars, S. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–21 June 2009; pp. 452–461. [Google Scholar]
  2. Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.S. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar]
  3. Wu, C.; Wu, F.; Qi, T.; Zhang, C.; Huang, Y.; Xu, T. MM-rec: Visiolinguistic model empowered multimodal news recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 2560–2564. [Google Scholar]
  4. He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 25–30 July 2020; pp. 639–648. [Google Scholar]
  5. He, R.; McAuley, J. VBPR: Visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
  6. Sun, R.; Cao, X.; Zhao, Y.; Wan, J.; Zhou, K.; Zhang, F.; Zheng, K. Multi-modal knowledge graphs for recommender systems. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, New York, NY, USA, 19–23 October 2020; pp. 1405–1414. [Google Scholar]
  7. Liu, K.; Xue, F.; Guo, D.; Wu, L.; Li, S.; Hong, R. MEGCF: Multimodal entity graph collaborative filtering for personalized recommendation. ACM Trans. Inform. Syst. 2023, 41, 30. [Google Scholar] [CrossRef]
  8. Wei, Y.; Wang, X.; He, X.; Nie, L.; Rui, Y.; Chua, T.S. Hierarchical user intent graph network for multimedia recommendation. IEEE Trans. Multimed. 2021, 24, 2701–2712. [Google Scholar] [CrossRef]
  9. Cai, D.; Qian, S.; Fang, Q.; Hu, J.; Ding, W.; Xu, C. Heterogeneous graph contrastive learning network for personalized micro-video recommendation. IEEE Trans. Multimed. 2022, 25, 2761–2773. [Google Scholar] [CrossRef]
  10. Mu, Z.; Zhuang, Y.; Tan, J.; Xiao, J.; Tang, S. Learning hybrid behavior patterns for multimedia recommendation. In Proceedings of the 30th ACM International Conference on Multimedia, Seattle, WA, USA, 10–14 October 2022; pp. 376–384. [Google Scholar]
  11. Yi, Z.; Wang, X.; Ounis, I.; Macdonald, C. Multi-modal graph contrastive learning for micro-video recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1807–1811. [Google Scholar]
  12. Ye, X.; Cai, G.; Song, Y. Multi-modal Personalized Goods Recommendation based on Graph Enhanced Attention GNN. In Proceedings of the 2022 5th International Conference on Machine Learning and Machine Intelligence, Hangzhou, China, 23–25 September 2022; pp. 146–153. [Google Scholar]
  13. Wei, Y.; Wang, X.; Nie, L.; He, X.; Hong, R.; Chua, T.S. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1437–1445. [Google Scholar]
  14. Wei, Y.; Wang, X.; Nie, L.; He, X.; Chua, T.S. Graph-refined convolutional network for multimedia recommendation with implicit feedback. In Proceedings of the 28th ACM international Conference on Multimedia, Seattle, WA, USA, 12 October 2020; pp. 3541–3549. [Google Scholar]
  15. Chen, F.; Wang, J.; Wei, Y.; Zheng, H.T.; Shao, J. Breaking isolation: Multimodal graph fusion for multimedia recommendation by edge-wise modulation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 385–394. [Google Scholar]
  16. Zhou, H.; Zhou, X.; Zhang, L.; Shen, Z. Enhancing Dyadic Relations with Homogeneous Graphs for Multimodal Recommendation. arXiv 2023, arXiv:2301.12097. [Google Scholar]
  17. Kim, T.; Lee, Y.C.; Shin, K.; Kim, S.W. MARIO: Modality-aware attention and modality-preserving decoders for multimedia recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 993–1002. [Google Scholar]
  18. Cui, X.; Qu, X.; Li, D.; Yang, Y.; Li, Y.; Zhang, X. MKGCN: Multi-modal knowledge graph convolutional network for music recommender systems. Electronics 2023, 12, 2688. [Google Scholar] [CrossRef]
  19. Liu, Y.; Yang, S.; Lei, C.; Wang, G.; Tang, H.; Zhang, J.; Sun, A.; Miao, C. Pre-training graph transformer with multimodal side information for recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 2853–2861. [Google Scholar]
  20. Zhou, Y.; Guo, J.; Song, B.; Chen, C.; Chang, J.; Yu, F.R. Trust-aware multi-task knowledge graph for recommendation. IEEE Trans. Knowl. Data Eng. 2022, 35, 8658–8671. [Google Scholar] [CrossRef]
  21. Liu, F.; Chen, H.; Cheng, Z.; Liu, A.; Nie, L.; Kankanhalli, M. Disentangled multimodal representation learning for recommendation. IEEE Trans. Multim. 2022, 25, 7149–7159. [Google Scholar] [CrossRef]
  22. Lei, F.; Cao, Z.; Yang, Y.; Ding, Y.; Zhang, C. Learning the user’s deeper preferences for multi-modal recommendation systems. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 138. [Google Scholar] [CrossRef]
  23. Wang, Q.; Wei, Y.; Yin, J.; Wu, J.; Song, X.; Nie, L. DualGNN: Dual graph neural network for multimedia recommendation. IEEE Trans. Multimed. 2021, 25, 1074–1084. [Google Scholar] [CrossRef]
  24. Zhang, J.; Zhu, Y.; Liu, Q.; Wu, S.; Wang, S.; Wang, L. Mining latent structures for multimedia recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 3872–3880. [Google Scholar]
  25. Zhou, X.; Shen, Z. A tale of two graphs: Freezing and denoising graph structures for multimodal recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 12–16 October 2020; pp. 935–943. [Google Scholar]
  26. Tao, Z.; Wei, Y.; Wang, X.; He, X.; Huang, X.; Chua, T.S. MGAT: Multimodal graph attention network for recommendation. Inform. Process. Manag. 2020, 57, 102277. [Google Scholar] [CrossRef]
  27. Zheng, L.; Noroozi, V.; Yu, P.S. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 425–434. [Google Scholar]
  28. Kang, W.C.; Fang, C.; Wang, Z.; McAuley, J. Visually-aware fashion recommendation and design with generative image models. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 207–216. [Google Scholar]
  29. Huang, Z.; Xu, X.; Ni, J.; Zhu, H.; Wang, C. Multimodal representation learning for recommendation in Internet of Things. IEEE Internet Things 2019, 6, 10675–10685. [Google Scholar] [CrossRef]
  30. Chen, X.; Chen, H.; Xu, H.; Zhang, Y.; Cao, Y.; Qin, Z.; Zha, H. Personalized fashion recommendation with visual explanations based on multimodal attention network: Towards visually explainable recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 765–774. [Google Scholar]
  31. Chen, J.; Zhang, H.; He, X.; Nie, L.; Liu, W.; Chua, T.S. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 335–344. [Google Scholar]
  32. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  33. Guo, Q.; Qiu, X.; Xue, X.; Zhang, Z. Syntax-guided text generation via graph neural network. Sci. China Inf. Sci. 2021, 64, 152102. [Google Scholar] [CrossRef]
  34. Liu, Q.; Yao, E.; Liu, C.; Zhou, X.; Li, Y.; Xu, M. M2GCN: Multi-modal graph convolutional network for modeling polypharmacy side effects. Appl. Intell. 2023, 53, 6814–6825. [Google Scholar] [CrossRef]
  35. Zhang, J.; Zhu, Y.; Liu, Q.; Zhang, M.; Wu, S.; Wang, L. Latent structure mining with contrastive modality fusion for multimedia recommendation. IEEE Trans. Knowl. Data Eng. 2022, 35, 9154–9167. [Google Scholar] [CrossRef]
  36. Zhou, X.; Zhou, H.; Liu, Y.; Zeng, Z.; Miao, C.; Wang, P.; Jiang, F. Bootstrap latent representations for multi-modal recommendation. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 845–854. [Google Scholar]
  37. Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
  38. Liu, S.; Chen, Z.; Liu, H.; Hu, X. User-video co-attention network for personalized micro-video recommendation. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 3020–3026. [Google Scholar]
  39. Tao, Z.; Liu, X.; Xia, Y.; Wang, X.; Yang, L.; Huang, X.; Chua, T.S. Self-supervised learning for multimedia recommendation. IEEE Trans. Multimed. 2022, 25, 5107–5116. [Google Scholar] [CrossRef]
  40. Wang, Y.; Xu, X.; Yu, W.; Xu, R.; Cao, Z.; Shen, H.T. Combine early and late fusion together: A hybrid fusion framework for image-text matching. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 10–12 January 2021; pp. 1–6. [Google Scholar]
  41. Chen, J.; Fang, H.R.; Saad, Y. Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection. J. Mach. Learn. Res. 2009, 10, 9. [Google Scholar]
  42. McAuley, J.; Targett, C.; Shi, Q.; Van Den Hengel, A. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 9–13 August 2015; pp. 43–52. [Google Scholar]
  43. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
  44. Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  45. Wu, X.; He, R.; Hu, Y.; Sun, Z. Learning an evolutionary embedding via massive knowledge distillation. Int. J. Comput. Vis. 2020, 128, 2089–2106. [Google Scholar] [CrossRef]
Figure 1. Overview of the proposed UPGCN model.
Figure 1. Overview of the proposed UPGCN model.
Applsci 14 10187 g001
Figure 2. Performance comparison between different variants of UPGCN.
Figure 2. Performance comparison between different variants of UPGCN.
Applsci 14 10187 g002
Figure 3. The effects of the fusion weight.
Figure 3. The effects of the fusion weight.
Applsci 14 10187 g003
Figure 4. The effects of weight for multimodal BPR loss.
Figure 4. The effects of weight for multimodal BPR loss.
Applsci 14 10187 g004
Table 1. Statistics of three datasets used in experiments.
Table 1. Statistics of three datasets used in experiments.
Dataset 1UsersItemsInteractionsSparsity
Baby19,4457050160,79299.88%
Sports35,59818,357296,33799.95%
Clothing39,38723,033278,67799.97%
1 Dataset can be accessed at http://jmcauley.ucsd.edu/data/amazon/links.html (accessed on 4 November 2024).
Table 2. Performance comparison of our UPGCN with baseline models (adapted from ref. [25]). Best results are underlined. The line below LightGCN serves as the boundary between the two types of methods mentioned in Section 4.2.
Table 2. Performance comparison of our UPGCN with baseline models (adapted from ref. [25]). Best results are underlined. The line below LightGCN serves as the boundary between the two types of methods mentioned in Section 4.2.
DatasetModelR@10R@20N@10N@20
BabyBPR0.03570.05750.01920.0249
LightGCN0.04790.07540.02570.0328
VBPR0.04230.06630.02230.0284
MMGCN0.04210.06600.02200.0282
GRCN0.05320.08240.02820.0358
DualGNN0.05130.08030.02780.0352
SLMRec0.05210.07720.02890.0354
LATTICE0.05470.08500.02920.0370
FREEDOM0.06270.09920.03300.0424
UPGCN0.06640.10430.03510.0449
Improv. (%)4.314.535.765.42
SportsBPR0.04320.06530.02410.0298
LightGCN0.05690.08640.03110.0387
VBPR0.05580.08560.03070.0384
MMGCN0.04010.06360.02090.0270
GRCN0.05990.09190.03300.0413
DualGNN0.05880.08990.03240.0404
SLMRec0.06630.09900.03350.0421
LATTICE0.06200.09530.03350.0421
FREEDOM0.07170.10890.03850.0481
UPGCN0.07400.11140.04020.0498
Improv. (%)3.212.294.413.53
ClothingBPR0.02060.03030.01140.0138
LightGCN0.03610.05440.01970.0243
VBPR0.02810.04150.01580.0192
MMGCN0.02270.03610.01200.0154
GRCN0.04210.06570.02240.0284
DualGNN0.04520.06750.02420.0298
SLMRec0.04420.06590.02410.0296
LATTICE0.04920.07330.02680.0330
FREEDOM0.06290.09410.03410.0420
UPGCN0.06440.09590.03480.0428
Improv. (%)2.381.912.051.90
Table 3. Ablation study of UPGCN on multimodal features.
Table 3. Ablation study of UPGCN on multimodal features.
DatasetVariantsR@10R@20N@10N@20
BabyUPGCNw/o-v&t0.04790.07540.02570.0328
UPGCNw/o-t0.05270.08270.02820.0359
UPGCNw/o-v0.06160.09840.03280.0422
UPGCN0.06640.10430.03510.0449
SportsUPGCNw/o-v&t0.05690.08640.03110.0387
UPGCNw/o-t0.06100.09310.03320.0415
UPGCNw/o-v0.07160.10790.03910.0484
UPGCN0.07400.11140.04020.0498
ClothingUPGCNw/o-v&t0.03400.05260.01880.0236
UPGCNw/o-t0.04300.06330.02290.0281
UPGCNw/o-v0.06160.09270.03360.0415
UPGCN0.06440.09590.03480.0428
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, B.; Liang, Y. UPGCN: User Perception-Guided Graph Convolutional Network for Multimodal Recommendation. Appl. Sci. 2024, 14, 10187. https://doi.org/10.3390/app142210187

AMA Style

Zhou B, Liang Y. UPGCN: User Perception-Guided Graph Convolutional Network for Multimodal Recommendation. Applied Sciences. 2024; 14(22):10187. https://doi.org/10.3390/app142210187

Chicago/Turabian Style

Zhou, Baihu, and Yongquan Liang. 2024. "UPGCN: User Perception-Guided Graph Convolutional Network for Multimodal Recommendation" Applied Sciences 14, no. 22: 10187. https://doi.org/10.3390/app142210187

APA Style

Zhou, B., & Liang, Y. (2024). UPGCN: User Perception-Guided Graph Convolutional Network for Multimodal Recommendation. Applied Sciences, 14(22), 10187. https://doi.org/10.3390/app142210187

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop