1. Introduction
The exponential increase in internet data has led to information-overload problems. Recommender systems are widely employed by many internet services (e.g., e-commerce, advertising systems, online reviews) to reduce information overload and recommend the most relevant items to users.
In recent years, the introduction of deep learning models based on graph neural networks has significantly improved recommendation systems. Earlier studies such as NCF [
1] and DMF [
2] used multi-layer perceptrons to handle nonlinear interactions. In addition, autoencoder-based [
3,
4] approaches map high-dimensional sparse user–item interaction graphs to increase the density of low-dimensional embedding representations. Later studies such as those considering NCF and NGCF [
5] used graph neural networks to mine high-order interaction relationships and conduct neighborhood-based feature aggregation. However, most of these approaches only model single-category user–item interactions (e.g., views and purchases) and ignore the characteristics of multiple user behaviors in real recommendation scenarios. For example, in a typical e-commerce platform, the interaction relationship between the same user and an item may be of multiple categories, including multiple interactions such as viewing, adding to cart, favorites, and purchasing (as shown in
Figure 1). Different interaction behaviors provide different semantic information about the user–item association, which may generate different personalized preferences under multiple user behaviors.
In current research, models based on graph convolution networks (GCNs) [
6] can learn user and item embedding representations in recommendation tasks by aggregating feature information from multiple layers of local neighborhoods. For example, NGCF [
5] can exploit the higher-order connectivity of graphs to encode higher-order information, thereby alleviating the sparsity problem in recommendation tasks. Based on simplifying the nonlinear features in GCNs, LR-GCN [
7] introduces a residual network structure to optimize the model. The results show that the model alleviates the over-smoothing problem, and greatly improves NGCF in terms of recommendation accuracy. Owing to the high computational complexity of the NGCF model, LightGCN [
8] omits feature transformation and nonlinear activation and solely retains neighborhood aggregation for collaborative filtering, improving upon the accuracy and efficiency of previous methods. However, the LightGCN model reaches its peak performance after stacking several layers (two, three, or four layers), and further stacking of layers leads to performance degradation. Therefore, current GCN models simply aggregate all neighborhood information to update the node embedding representation. The nodes then become indistinguishable after the convolution of the multi-layer graph, which represents an over-smoothing problem.
The above-mentioned problems may be summarized as follows. First, in the user–item interaction graph, higher-order neighboring users may have differing or even opposite preferences, and the over-smoothing problem may occur by simply performing graph convolution operations on all higher-order information. Second, only static user–item interactions are considered, and less research has been performed on dynamic user–item interactions. Finally, in multi-behavior recommendation tasks, only single-type edges are considered important, and the real recommendation scenario is ignored. Therefore, we propose a graph transformer collaborative filtering method for multi-behavior recommendation (GTCF4MB) based on graph transformer collaborative filtering.
The contributions of our work are summarized as follows:
A subgraph generation model is proposed that groups users and the items they interact with into completely different subgraphs and performs higher-order graph convolution operations within the subgraphs to alleviate the over-smoothing problem caused by stacking too many layers.
A temporal encoding strategy is proposed to capture the impact between different kinds of user–item interactions in dynamic scenarios and incorporate dynamic behaviors into the overall information aggregation model. In addition, an attention mechanism that relies on node and edge types is introduced to distinguish the importance of different edge types.
Using high-order multi-hop information, nodes (users and items) and edges (behaviors) under different weights are interacted with to achieve multi-behavior recommendations, alleviate the model cold-start problem, and increase model interpretability.
3. Method
3.1. Problem Definition
User–item interaction graphs are often used as the focus of recommendation system studies. Let be the behavioral interaction matrix of users and items, where users are , items are , and behaviors are . For a non-zero term , the user interacts with item under the behavior ; conversely, the term is zero when there is no interaction between user and item under the behavior. Based on the user–item interaction matrix, a behavior graph can be formed, where denotes the set of nodes, denotes the set of edges, and denotes the set of behaviors. For a nonzero term , there exists an edge between user under behavior and term . Using the above information as the input to the GCN model, the representation of users and items is learned by iteratively aggregating the features of neighboring nodes in the bipartite graph. Let denote the ID embedding of user , the ID embedding of item , and the ID embedding of behavior .
3.2. Subgraph Generation Module
GCN-based recommendation models for user–item interaction graphs, higher-order neighborhood users, and target users may not have common preferences or may even have opposite preferences. Therefore, simply aggregating all synergistic information from higher-order neighbors is not necessarily beneficial for user embedding. In this study, we partitioned the interaction graph into multiple subgraphs and used the idea of collaborative filtering to build a model of users with similar preferences. We propose an unsupervised learning subgraph generation module, the model of which is shown in
Figure 2.
Given the input behavior graph
, this subgraph generation module partitions the behavior graph
into subgraphs
with
. In the experiment, the number of subgraphs was obtained by artificial settings. The purpose was to generate a task that classifies users with similar preferences into the same group [
24]. The feature vector consisting of the graph structure and ID embedding of each user is represented as:
where
is the user feature obtained through feature fusion,
is the ID embedding of user
, and
is obtained by aggregating the first layer of neighborhoods in the interaction graph of the user representation.
and
are the weight matrix and bias vector of feature fusion, respectively.
is the activation function, such as LeakyReLU. LeakyReLU [
25] was utilized in this method as it can encode either positive signals or small negative signals. To classify different users into subgraphs with different preferences, the obtained user features are converted into prediction vectors using a two-layer neural network as follows:
where
denotes the prediction vector to which the user is assigned to a specific subgraph. The position of the maximum value of
indicates the group/subgraph to which the user belongs.
is the same as the activation function in Equation (5), i.e., LeakyReLU.
,
,
, and
are the weight matrices and bias vectors of the two layers. The number of subgraphs is the same as the number of dimensions of the prediction vector
, which is a hyperparameter that can be set in advance of an experiment. Therefore, in aggregating the synergistic information from high-order neighborhoods, the model only needs to learn the embedding of nodes using messages from nodes in the same subgraph thereby effectively alleviating the over-smoothing problem.
3.3. Dynamic Encoding Layer
In practical recommendation scenarios, user–item interactions are often performed in a dynamic and temporal manner. For this reason, we considered positional encoding in the transformer model to incorporate dynamic interaction behaviors into the overall model developed in this study. For the kth interaction behavior
of user
and item
, the corresponding timestamp
is mapped to the time slot
and embedded in
using a sinusoidal function.
Here, and are denoted as the even and odd position indexes in the dynamic time embedding, respectively, and is the potential dimensionality.
3.4. Information Propagation Layer
In the user–item multi-behavior interaction subgraph, the same user produces different item preferences for different behaviors, and different users produce different preferences for items under the same behavior. To better utilize multi-behavior information modeling, we propagated the multiheaded attention mechanism for user, behavior, and item embedding [
26], as shown in
Figure 3.
Here,
denotes the subgraph of
of item
in
.
and
denote the information transfer from item
to user
and from user
to item
, respectively.
is the projection matrix of the h-head values for the kth behavior.
denotes the fusion of the ID embedding of item
and behavior
with their dynamic encoding function, and
denotes the fusion of the ID embedding of user
and behavior
with their dynamic encoding function, which are defined as:
where
denotes the vector element product with the purpose of embedding the behavior into the model to make it behavior-aware.
denotes the attention weight learned by user
under behavior
with item
, where
is calculated similarly as:
where
and
are the h-head query matrix and key matrix under the kth behavior, respectively.
3.5. Information Aggregation Layer
The final operation step of the model is to aggregate the neighborhood information into the target node. The aggregated target node is nonlinearly transformed to obtain the embedding representation of the target node under the behavior type, which may be represented as:
where
and
denote the set of neighborhood nodes of user
and item
in the user–item interaction subgraph, respectively.
denotes the LeakyReLu activation function.
and
denote the embedding representation obtained under behavior k after aggregation.
Based on the idea of an iterative multi-layer update embedding representation of the GCN model, we represent the updating process of the multi-behavior higher-order co-relation from layer 0 to layer
L as:
where
is the set of subgraphs to which item
belongs,
denotes the embedding representation of user
after aggregating all behaviors and iterating over
L layers,
denotes the embedding representation of item
after aggregating the set of subgraphs and behaviors and iterating over
L layers,
denotes the embedding representation of behavior
, and
denotes the parameter matrix corresponding to layer
. In addition, a weight value
is set for each layer embedding process.
3.6. Multi-Behavior Prediction Layer
In the prediction layer of this model, the final representations of user
, item
, and behavior
are obtained by combining the embedding representations obtained in each layer, as shown in Equations (18)–(20). With the learned embeddings of users
, items
, and behaviors
, the prediction of user preferences for items is calculated via the inner product as:
where
denotes the diagonal matrix with diagonal entries all equal to
. To optimize the model, we use pairwise loss for learning. Specifically, in small-batch training, we define the positive interaction terms of user
in
(for example,
) [
13]. In the negative instances of the generation process, we randomly sampled
non-interactive terms
of user
. For a given sample of positive and negative instances, we define the loss function as:
where
and
are the
sth positive and negative instances, respectively.
and
denote the regularization coefficients and parameters of the model, respectively. The 𝐿2 regularization method is used to prevent overfitting.
3.7. Model Complexity Analysis
The time cost of GTCF4MB mainly comes from two aspects. The information propagation layer costs for embedding transformation, where , , , and denote the behavior types, the number of users, the number of items, and the latent factors. Additionally, the prediction layer costs a minor computational complexity. In conclusion, our GTCF4MB model could achieve a comparable model time complexity with graph-convolution-network-based multi-behavior recommendation methods.
4. Experiments
In this section, we introduce the dataset, evaluation metrics, and parameter settings and conduct experiments on different datasets to evaluate the performance of our GTCF4MB model via comparison with multiple recommendation techniques. An Intel(R) Xeon(R) Gold 6140 CPU @2.30GHz, NVIDIA Tesla T4 GPU, and Ubuntu 16.4 were used in the experiments. Experiments were conducted considering the following:
Performance of the GTCF4MB model compared to other models using three publicly available datasets.
Effect of removing the dynamic coding layer on the GTCF4MB model.
Effect of removing the multi-head attention mechanism on the GTCF4MB model.
Effect of removing the subgraph generation module on the GTCF4MB model.
Effect of varying the number of convolution layers on the GTCF4MB model.
Effect of varying the number of subgraphs on the GTCF4MB model.
4.1. Datasets
To evaluate the effectiveness of the proposed model, experiments were conducted on three datasets in this study. The datasets are described as follows.
Taobao dataset: this dataset was provided by Taobao, one of the largest e-commerce platforms in China, and contained data from four types of user–item relationships: view, add to cart, favorite, and purchase.
Beibei dataset: this dataset was provided by
Beibei.com (accessed on 13 May 2022), the largest online retail site for baby products in China. It involved three types of user–item interactions: view, add to cart, and purchase.
MovieLens dataset: this dataset is a benchmark dataset widely used for evaluating the performance strengths and weaknesses of recommender systems. In this study we distinguished multiple behavior types based on evaluation scores of . More specifically, indicates disliked behavior, indicates neutral behavior, and indicates liked behavior. In this dataset, the target and auxiliary behaviors were set as (like) and (neutral, dislike), respectively.
Based on the previous study setup, for each dataset, we randomly divided it into training, validation, and test sets in the ratio of 8:1:1. For each group of experiments, we averaged the results after 5 times.
4.2. Evaluation Metrics
For the performance evaluation, all experiments used two representative metrics, namely the hit ratio (HR) and normalized discounted cumulative gain (NDCG). These two metrics have been widely used in Top-N recommendation tasks [
5,
27], where higher or lower HR and NDCG scores can represent good or bad recommendation results. To evaluate the model effectively and fairly, each positive instance was sampled from the user’s interactive and non-interactive items with 99 negative instances. HR and NDCG are calculated as follows.
where
indicates the number of users whose items in the test set appear in the Top-K recommendation list.
indicates the total number of users in the test set.
DCG@K indicates the discounted cumulative gain and is also a parameter.
IDCG@K is the maximum value obtained in
DCG@K, which is a list of the best recommendations returned by a user.
is the size of the recommendation list.
is the relevance of the ith position of the recommendation list.
4.3. Parameter Setting
In this experiment, the proposed GTCF4MB model was implemented using the TensorFlow framework and the model was optimized using the Adam optimizer in the training phase. The default learning rate of the model was 1 × 10−3, the decay rate was 0.96, the multi-headed attention was fixed at 2, the regularization coefficients λ were chosen from the set 0.1, 0.05, 0.01, 0.005, 0.001, and batch size was chosen from the set 32, 128, 256.
4.4. Baselines
To comprehensively assess the model performance, the GTCF4MB model was compared with other benchmark research methods with different research objectives in this study via experimentation, as described below.
NCF [
1]: this approach uses neural networks to enhance the recommendation performance of collaborative filtering. In this study, we considered three variants of different modeling approaches for NCF: (1) combining matrix decomposition and multi-layer perceptron methods (i.e., NCF-N); (2) modeling user–item interactions using a multi-layer perceptron (i.e., NCF-M); and (3) performing a fixed product of elements for user and item embeddings (i.e., NCF-G).
NGCF [
5]: this method is a graph neural collaborative filtering model that maps user items into potential representations using a model of information transfer with higher-order neighborhood connectivity relations.
AutoRec [
3]: this method maps user-object interaction features to potential low-dimensional representations by stacking multiple layers of perceptrons.
NGCF
M [
5]: this method combines multi-behavioral relational learning under graph neural networks with a graph collaborative filtering model.
NMTR [
28]: this approach is a multitask recommendation framework designed as a shared embedding layer for different types of behaviors.
MBGCN [
10]: this method utilizes a multiple user–item interaction behavior model and performs behavior-aware embedding propagation through graph convolutional networks.
MB-GMN [
11]: this approach extracts user-personalized information through meta-learning and injects it into a graph migration learning-based framework, which allows learning and provides more accurate recommendations for users’ complex interests.
GNMR [
13]: this method models different types of user–item interactions under the message-passing structure of graph neural networks and designs a relational aggregation network for embedding propagation over the interaction graph.
4.5. Experimental Results and Analysis
After the experiments, the performance results of each model are shown in
Table 1. We first analyzed the performance of the GTCF4MB model compared with the other considered models for the three datasets in
Table 1. The results show that the GTCF4MB model outperformed the other models in terms of both the HR@10 and NDCG@10 metrics. It is observed that the performance of the graph-neural-network-based model was better than that of the self-encoder-based model, indicating the rationality of mining higher-order collaborative information on user–item relationships.
In comparing the multi-behavior recommendation models based on the Taobao dataset, the HR@10 metrics of the GTCF4MB model were improved by 32.96%, 9.81%, and 22.22% over the latest baseline models MBGCN, MB-GMN, and GNMR, respectively. The NDCG@10 metrics of the GTCF4MB model improved by 32.71%, 8.64%, and 27.46% over those of the latest baseline models MBGCN, MB-GMN, and GNMR, respectively.
In the Beibei dataset, the HR@10 metrics of the GTCF4MB model improved by 12.01%, 5.24%, and 8.56% over the latest baseline models MBGCN, MB-GMN, and GNMR, respectively. The NDCG@10 metrics of the GTCF4MB model improved by 12.08%, 4.50%, and 6.16% over those of the latest baseline models MBGCN, MB-GMN, and GNMR, respectively.
In the MovieLens dataset, the HR@10 metrics of the GTCF4MB model improved by 10.29%, 2.21%, and 6.31% over the latest baseline models, MBGCN, MB-GMN, and GNMR, respectively. The NDCG@10 metrics of the GTCF4MB model improved by 12.21%, 2.93%, and 7.00% over the latest baseline models, MBGCN, MB-GMN, and GNMR, respectively.
As seen in
Table 1, the results obtained from the experiments conducted on the Taobao dataset show that the HR@10 performance metrics were generally lower for all the models. This is probably due to the fact that the Taobao dataset was sparse compared to the other two datasets, and also the metric was generally low in previous studies.
It can also be seen from
Table 2 that the GTCF4MB model achieved a relatively improved performance under different @K values with the Beibei dataset. The above results indicate the feasibility of introducing a multi-headed attention mechanism that can model behavior-specific user–item interaction features better than the traditional graph convolutional network based on the synergistic information of higher-order neighborhoods, which can deeply and adaptively determine the preferences of different user behaviors and improve the recommendation performance.
4.6. Ablation Study
To verify the validity of the GTCF4MB model submodule for the experiment, this model was evaluated with the removal of several variables for the three datasets based on HR@10 and NDCG@10:
GTCF4MB-Ti: This variant model of GTCF4MB removes the dynamic encoding layer.
GTCF4MB-At: This variant model of GTCF4MB removes the multi-head attention mechanism.
GTCF4MB-SG: This variant model of GTCF4MB removes the sub-graph generation module.
As shown in
Table 3, the GTCF4MB model performed better than the GTCF4MB models with removed variables. Therefore, the following conclusions may be drawn: (1) adding a dynamic coding layer could better capture the dynamic multi-behavior information of users and positively affected the model compared to static fixed processing; (2) the multi-headed attention mechanism had a large impact on the model performance, and removing the multi-headed attention mechanism lead to a significant decrease in the model performance, indicating the importance of this module that adaptively models user items under specific behaviors; and (3) the subgraph generation module mainly affected the number of layers of iterative graph convolution, which is analyzed in detail in
Section 4.7.
4.7. Hyperparameter Analysis
To investigate the effectiveness of the GTCF4MB model in deeper hierarchical structures, we increased the depth of the model. Therefore, to verify whether the investigated number of convolutional layers
and number of subgraphs
that are generated can mitigate the over-smoothing problem, the following experiments were conducted by setting the number of convolutional layers
l = 2,3,4,5,6,7 and the number of subgraphs
s = 2,3,4. The experimental results are shown in
Figure 4.
First, the effect of the number of layers of graph convolution was analyzed. As shown in
Figure 4, setting the GTCF4MB model to different subgraph numbers exhibited improved HR@10 values after iterating with more than three or four layers. In contrast, considering the previously developed models evaluated in this study, their performances degraded owing to the over-smoothing problem after the number of iterative graph convolution layers exceeded three or four layers. This indicates that the model developed in this study exhibits a slightly improved performance after undergoing iterative multi-layer convolution, especially when six or seven layers are utilized. However, if the number of layers is increased beyond seven, a node aggregates almost all higher-order information; therefore, in the experiment, the maximum number of layers was set to seven, and continuing to stack layers did not significantly affect the experimental results. The above results show that the developed model can effectively mitigate the over-smoothing problem.
Next, the effect of the number of subgraph generations was analyzed. The results in
Figure 4 show that for both datasets, the GTCF4MB_2 model exhibited a better performance when the number of layers did not exceed three; when the number of layers exceeded three, both the GTCF4MB_3 model and GTCF4MB_4 model exhibited improved performances compared to the GTCF4MB_2 model. This occured because when more than three layers are utilized, the amount of information from nodes that other nodes are exposed to during information propagation increases dramatically, and therefore the GTCF4MB_3 and GTCF4MB_4 models with higher numbers of subgraph sets exhibited higher HR values. As the number of layers continued to stack, the GTCF4MB_4 model did not exhibit an improved performance, which may be due to the increased number of subgraphs causing the model to focus more on the extraction of higher-order neighborhood information and ignore the information from lower-order neighborhoods. This may cause the performance of the GTCF4MB_4 model to plateau as number of layers is increased.
5. Conclusions
In this study, we proposed a graph transformer collaborative filtering method for multi-behavior recommendation to achieve an improved recommendation performance for multi-behavior data. This system classifies users with similar preferences into the same group by utilizing a subgraph generation module in an unsupervised manner to propagate node information embedded within the subgraph. This method effectively alleviates the over-smoothing problem caused by increasing the number of graph convolution layers. Simultaneously, dynamic coding layers and a multiheaded self-attention mechanism were introduced to dynamically embed user behavior and item representations during embedding propagation. This permits the proposed method to explore user preferences obtained under different behaviors at a deeper level. In all three datasets, the GTCF4MB model achieved a better performance than all other evaluated models.
In summary, we argued that classifying users with similar preferences in the user–item interaction graph into a subgraph not only avoids the noisy information generated by other users who do not have similar preferences or for other reasons, but also improves the embedding representation of users. Then, we introduced a multi-headed attention mechanism with the aim of mapping the embedding representation into the attention feature space by query vectors, key vectors, and value vectors (i.e., Q, K, V). The results show that the embedding representation can be effectively enhanced by this method.
Considering that most current recommendation models are supervised models that utilize user–item interactions as supervised signals, it is difficult to obtain good recommendation results when the user–item interaction data is sparse. Therefore, in future work, methods such as self-supervised graph learning should be incorporated into recommendation systems to address the problem of sparse data caused by relying on supervised signals.