Graph Transformer Collaborative Filtering Method for Multi-Behavior Recommendations

Zhu, Wenhao; Xie, Yujun; Huang, Qun; Zheng, Zehua; Fang, Xiaozhao; Huang, Yonghui; Sun, Weijun

doi:10.3390/math10162956

Open AccessArticle

Graph Transformer Collaborative Filtering Method for Multi-Behavior Recommendations

by

Wenhao Zhu

,

Yujun Xie

,

Qun Huang

,

Zehua Zheng

,

Xiaozhao Fang

,

Yonghui Huang

and

Weijun Sun

^*

School of Automation, Guangdong University of Technology, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(16), 2956; https://doi.org/10.3390/math10162956

Submission received: 26 July 2022 / Revised: 11 August 2022 / Accepted: 14 August 2022 / Published: 16 August 2022

(This article belongs to the Special Issue Deep Learning and Adaptive Control)

Download

Browse Figures

Versions Notes

Abstract

:

Graph convolutional networks are widely used in recommendation tasks owing to their ability to learn user and item embeddings using collaborative signals from high-order neighborhoods. Most of the graph convolutional recommendation tasks in existing studies have specialized in modeling a single type of user–item interaction preference. Meanwhile, graph-convolution-network-based recommendation models are prone to over-smoothing problems when stacking increased numbers of layers. Therefore, in this study we propose a multi-behavior recommendation method based on graph transformer collaborative filtering. This method utilizes an unsupervised subgraph generation model that divides users with similar preferences and their interaction items into subgraphs. Furthermore, it fuses multi-headed attention layers with temporal coding strategies based on the user–item interaction graphs in the subgraphs such that the learned embeddings can reflect multiple user–item relationships and the potential for dynamic interactions. Finally, multi-behavior recommendation is performed by uniting multi-layer embedding representations. The experimental results on two real-world datasets show that the proposed method performs better than previously developed systems.

Keywords:

recommendation system; graph convolutional network; subgraph; transformer; multi-behavior recommendation

MSC:

68T07

1. Introduction

The exponential increase in internet data has led to information-overload problems. Recommender systems are widely employed by many internet services (e.g., e-commerce, advertising systems, online reviews) to reduce information overload and recommend the most relevant items to users.

In recent years, the introduction of deep learning models based on graph neural networks has significantly improved recommendation systems. Earlier studies such as NCF [1] and DMF [2] used multi-layer perceptrons to handle nonlinear interactions. In addition, autoencoder-based [3,4] approaches map high-dimensional sparse user–item interaction graphs to increase the density of low-dimensional embedding representations. Later studies such as those considering NCF and NGCF [5] used graph neural networks to mine high-order interaction relationships and conduct neighborhood-based feature aggregation. However, most of these approaches only model single-category user–item interactions (e.g., views and purchases) and ignore the characteristics of multiple user behaviors in real recommendation scenarios. For example, in a typical e-commerce platform, the interaction relationship between the same user and an item may be of multiple categories, including multiple interactions such as viewing, adding to cart, favorites, and purchasing (as shown in Figure 1). Different interaction behaviors provide different semantic information about the user–item association, which may generate different personalized preferences under multiple user behaviors.

In current research, models based on graph convolution networks (GCNs) [6] can learn user and item embedding representations in recommendation tasks by aggregating feature information from multiple layers of local neighborhoods. For example, NGCF [5] can exploit the higher-order connectivity of graphs to encode higher-order information, thereby alleviating the sparsity problem in recommendation tasks. Based on simplifying the nonlinear features in GCNs, LR-GCN [7] introduces a residual network structure to optimize the model. The results show that the model alleviates the over-smoothing problem, and greatly improves NGCF in terms of recommendation accuracy. Owing to the high computational complexity of the NGCF model, LightGCN [8] omits feature transformation and nonlinear activation and solely retains neighborhood aggregation for collaborative filtering, improving upon the accuracy and efficiency of previous methods. However, the LightGCN model reaches its peak performance after stacking several layers (two, three, or four layers), and further stacking of layers leads to performance degradation. Therefore, current GCN models simply aggregate all neighborhood information to update the node embedding representation. The nodes then become indistinguishable after the convolution of the multi-layer graph, which represents an over-smoothing problem.

The above-mentioned problems may be summarized as follows. First, in the user–item interaction graph, higher-order neighboring users may have differing or even opposite preferences, and the over-smoothing problem may occur by simply performing graph convolution operations on all higher-order information. Second, only static user–item interactions are considered, and less research has been performed on dynamic user–item interactions. Finally, in multi-behavior recommendation tasks, only single-type edges are considered important, and the real recommendation scenario is ignored. Therefore, we propose a graph transformer collaborative filtering method for multi-behavior recommendation (GTCF4MB) based on graph transformer collaborative filtering.

The contributions of our work are summarized as follows:

A subgraph generation model is proposed that groups users and the items they interact with into completely different subgraphs and performs higher-order graph convolution operations within the subgraphs to alleviate the over-smoothing problem caused by stacking too many layers.
A temporal encoding strategy is proposed to capture the impact between different kinds of user–item interactions in dynamic scenarios and incorporate dynamic behaviors into the overall information aggregation model. In addition, an attention mechanism that relies on node and edge types is introduced to distinguish the importance of different edge types.
Using high-order multi-hop information, nodes (users and items) and edges (behaviors) under different weights are interacted with to achieve multi-behavior recommendations, alleviate the model cold-start problem, and increase model interpretability.

2. Related Work

2.1. Recommendation Methods for Multi-Behavior Models

Many previously developed recommender systems are used to characterize not only the user, but also the different types of relationships from the user or item side [9]. For example, the MBGCN [10] model is a multi-behavioral model of user–item interactions that uses graph convolutional networks to capture the multi-behavioral interactions between users and items and propagates behavior-aware embeddings through graph convolutional networks (GCNs). The MB-GMN [11] model utilizes multi-behavioral pattern encoding, meta-graph neural networks, and meta-learning migration networks, thus enabling complex user interests for learning and making more accurate recommendations. The GHCF [12] model not only utilizes user–item multi-behavior relations for higher-order embedding representations of nodes and edges but also considers higher-order heterogeneous relations. The GNMR [13] model achieves the modeling of different types of user–item interactions under the message passing structure of graph neural networks by designing a relational aggregation network.

External data can also be used in multi-behavioral recommendations. For example, to address the problem of sparse item data, many social recommendation systems have been proposed to improve model prediction performance through user–user social relationships and user–item interactions [14,15]. Meanwhile, another form of multi-behavioral recommendation model uses external knowledge graph information to construct different structural relationships between a collection of entities or items [16,17].

2.2. Graph Convolution Network

A GCN [6] is a typical model of a graph neural network, whose basic idea is to aggregate neighborhood features into a representation of the target node by iteratively performing graph convolution. Consider a behavior graph

G = (V, E, R)

, where

V

denotes the set of nodes,

E

denotes the set of edges, and

R

denotes the set of behaviors. The information transfer layer of the node representation is given as:

E^{(l + 1)} = σ ({\hat{D}}^{- \frac{1}{2}} \hat{A} {\hat{D}}^{- \frac{1}{2}} E^{(l)} W_{r}^{(l)})

(1)

where

\hat{A} = A + I

and

\hat{D} = D + I

.

A

,

D

, and

I

are the adjacency, diagonal node degree, and unit matrices, respectively. l is used to integrate the self-loop connections on the nodes.

E^{(l)}

and

W_{r}^{(l)}

denote the feature matrix of layer

l

and weight matrix of a particular behavior

r

, respectively, and

σ (\cdot)

denotes the nonlinear activation function (e.g., ReLU).

Recent graph convolution studies [7,8,18,19] have found that the proper simplification of GCNs can not only reduce computational complexity but also further improve the performance of collaborative filtering (CF) tasks. For example, LightGCN [8] simplified a GCN model on NGCF [5] by omitting the feature transform

W^{(l)}

and nonlinear activation function

σ

. The matrix form of its information transfer layer can be expressed as:

E^{(l + 1)} = ({\hat{D}}^{- \frac{1}{2}} \hat{A} {\hat{D}}^{- \frac{1}{2}}) E^{(l)}

(2)

In the user–item interaction bipartite graph, the matrix form expressions of its information transfer layer can also be abstractly represented as:

e_{u}^{(l + 1)} = \sum_{v \in N_{u}} \frac{1}{\sqrt{|N_{u}|} \sqrt{|N_{v}|}} e_{v}^{(l)}

(3)

e_{v}^{(l + 1)} = \sum_{u \in N_{v}} \frac{1}{\sqrt{|N_{v}|} \sqrt{|N_{u}|}} e_{u}^{(l)}

(4)

where

e_{u}^{(l)}

and

e_{v}^{(l)}

denote the embedding representations of user

u

and item

v

, respectively, after

l

-layer propagation;

N_{u}

denotes the set of items interacting with user

u

; and

N_{v}

denotes the set of users interacting with item

v

. The LightGCN model reaches its peak performance after stacking four layers of neighborhood information, and the performance degrades when layers continue to be stacked.

2.3. Transformer

A transformer [20] is an end-to-end model based on the multi-head attention mechanism and feed-forward networks proposed by the Google team in recent years. Transformer models have been widely used in the field of natural language processing and have shown excellent performance in recent years. For example, Graphormer [21] applied a transformer model to graph data and utilized spatial, centrality, and edge encoding on top of the existing data to help the transformer model capture the structural information of the graph, which made the graph nodes have a stronger representation. GTN [22] used the transformer model to learn meta-paths that convert a heterogeneous input graph into a meta-path graph that is useful for each task and learn node representations on the graph in an end-to-end manner. HGT [23] utilizes a heterogeneous mutual-attention mechanism for the heterogeneous graph meta-relational triad, using meta-relations to parameterize the weight matrix to calculate the attention on each edge and introduce a relative time encoding strategy to enable HGT to learn dynamic representations of heterogeneous graphs.

3. Method

3.1. Problem Definition

User–item interaction graphs are often used as the focus of recommendation system studies. Let

A \in R^{I \times J \times K}

be the behavioral interaction matrix of users and items, where users are

U = u_{1}, \dots, u_{i}, \dots, u_{I}

, items are

V = v_{1}, \dots, v_{j}, \dots, v_{J}

, and behaviors are

R = r_{1}, \dots, r_{k}, \dots, r_{K}

. For a non-zero term

a_{i j}^{k} \in A

, the user

u \in U

interacts with item

v \in V

under the behavior

r \in R

; conversely, the term is zero when there is no interaction between user and item under the behavior. Based on the user–item interaction matrix, a behavior graph

G = (V, E, R)

can be formed, where

V

denotes the set of nodes,

E

denotes the set of edges, and

R

denotes the set of behaviors. For a nonzero term

a_{i j}^{k}

, there exists an edge between user

u_{i}

under behavior

r_{k}

and term

v_{j}

. Using the above information as the input to the GCN model, the representation of users and items is learned by iteratively aggregating the features of neighboring nodes in the bipartite graph. Let

e_{i}^{(0)}

denote the ID embedding of user

u_{i}

,

e_{j}^{(0)}

the ID embedding of item

v_{j}

, and

e_{k}^{(0)}

the ID embedding of behavior

r_{k}

.

3.2. Subgraph Generation Module

GCN-based recommendation models for user–item interaction graphs, higher-order neighborhood users, and target users may not have common preferences or may even have opposite preferences. Therefore, simply aggregating all synergistic information from higher-order neighbors is not necessarily beneficial for user embedding. In this study, we partitioned the interaction graph into multiple subgraphs and used the idea of collaborative filtering to build a model of users with similar preferences. We propose an unsupervised learning subgraph generation module, the model of which is shown in Figure 2.

Given the input behavior graph

G

, this subgraph generation module partitions the behavior graph

G

into subgraphs

G_{s}

with

s \in 1, \dots, N_{s}

. In the experiment, the number of subgraphs was obtained by artificial settings. The purpose was to generate a task that classifies users with similar preferences into the same group [24]. The feature vector consisting of the graph structure and ID embedding of each user is represented as:

F_{i} = σ (W_{1} (e_{i}^{(0)} + e_{i}^{(1)}) + b_{1})

(5)

where

F_{i}

is the user feature obtained through feature fusion,

e_{i}^{(0)}

is the ID embedding of user

u_{i}

, and

e_{i}^{(1)}

is obtained by aggregating the first layer of neighborhoods in the interaction graph of the user representation.

W_{1} \in R^{d \times d}

and

b_{1} \in R^{1 \times d}

are the weight matrix and bias vector of feature fusion, respectively.

σ

is the activation function, such as LeakyReLU. LeakyReLU [25] was utilized in this method as it can encode either positive signals or small negative signals. To classify different users into subgraphs with different preferences, the obtained user features are converted into prediction vectors using a two-layer neural network as follows:

U_{o} = W_{3} \cdot σ (W_{2} F_{i} + b_{2}) + b_{3}

(6)

where

U_{o}

denotes the prediction vector to which the user is assigned to a specific subgraph. The position of the maximum value of

U_{o}

indicates the group/subgraph to which the user belongs.

σ

is the same as the activation function in Equation (5), i.e., LeakyReLU.

W_{2} \in R^{d \times d}

,

W_{3} \in R^{d \times N_{s}}

,

b_{2} \in R^{1 \times d}

, and

b_{3} \in R^{1 \times N_{s}}

are the weight matrices and bias vectors of the two layers. The number of subgraphs is the same as the number of dimensions of the prediction vector

U_{o}

, which is a hyperparameter that can be set in advance of an experiment. Therefore, in aggregating the synergistic information from high-order neighborhoods, the model only needs to learn the embedding of nodes using messages from nodes in the same subgraph thereby effectively alleviating the over-smoothing problem.

3.3. Dynamic Encoding Layer

In practical recommendation scenarios, user–item interactions are often performed in a dynamic and temporal manner. For this reason, we considered positional encoding in the transformer model to incorporate dynamic interaction behaviors into the overall model developed in this study. For the kth interaction behavior

r_{i, j}^{k}

of user

u_{i}

and item

v_{j}

, the corresponding timestamp

t_{i, j}^{k}

is mapped to the time slot

τ (t_{i, j}^{k})

and embedded in

T_{i, j}^{k}

using a sinusoidal function.

T_{(i, j)}^{k, (2 l)} = s i n (\frac{τ (t_{i, j}^{k})}{10000^{\frac{2 l}{d}}})

(7)

T_{(i, j)}^{k, (2 l + 1)} = s i n (\frac{τ (t_{i, j}^{k})}{10000^{\frac{2 l + 1}{d}}})

(8)

Here,

(2 l)

and

(2 l + 1)

are denoted as the even and odd position indexes in the dynamic time embedding, respectively, and

d

is the potential dimensionality.

3.4. Information Propagation Layer

In the user–item multi-behavior interaction subgraph, the same user produces different item preferences for different behaviors, and different users produce different preferences for items under the same behavior. To better utilize multi-behavior information modeling, we propagated the multiheaded attention mechanism for user, behavior, and item embedding [26], as shown in Figure 3.

m_{i \leftarrow j s}^{k} = \prod_{h = 1}^{H} ω_{i, j s, k}^{h} \cdot V_{k}^{h} \cdot P_{j s}

(9)

m_{j s \leftarrow i}^{k} = \prod_{h = 1}^{H} ω_{j s, i, k}^{h} \cdot V_{k}^{h} \cdot P_{i}

(10)

Here,

j s

denotes the subgraph of

s \in 1, \dots, N_{s}

of item

v_{j s}

in

G_{s}

.

m_{i \leftarrow j s}^{k}

and

m_{j s \leftarrow i}^{k}

denote the information transfer from item

v_{j s}

to user

u_{i}

and from user

u_{i}

to item

v_{j s}

, respectively.

V_{k}^{h}

is the projection matrix of the h-head values for the kth behavior.

P_{j s}

denotes the fusion of the ID embedding of item

v_{j s}

and behavior

r_{k}

with their dynamic encoding function, and

P_{i}

denotes the fusion of the ID embedding of user

u_{i}

and behavior

r_{k}

with their dynamic encoding function, which are defined as:

P_{j s} = ϕ (e_{i}, e_{k}) \oplus T_{i, j s}^{k}

(11)

P_{i} = ϕ (e_{j s}, e_{k}) \oplus T_{i, j s}^{k}

(12)

where

ϕ (\cdot)

denotes the vector element product with the purpose of embedding the behavior into the model to make it behavior-aware.

ω_{i, j s, k}^{h}

denotes the attention weight learned by user

u_{i}

under behavior

u_{i}

with item

v_{j s}

, where

ω_{j s, i, k}^{h}

is calculated similarly as:

{\bar{ω}}_{i, j s, k}^{h} = \frac{{(Q_{k}^{h} p_{i})}^{⊤} (K_{k}^{h} p_{j s})}{\sqrt{d / H}}

(13)

ω_{i, j s, k}^{h} = s o f t m a x ({\bar{ω}}_{i, j s, k}^{h}) = \frac{e x p^{({\bar{ω}}_{i, j s, k}^{h})}}{Σ_{e_{i, j s^{'}}^{k} \in E_{u}} e x p^{({\bar{ω}}_{i, j s^{'}, k}^{h})}}

(14)

where

Q_{k}^{h}

and

K_{k}^{h}

are the h-head query matrix and key matrix under the kth behavior, respectively.

3.5. Information Aggregation Layer

The final operation step of the model is to aggregate the neighborhood information into the target node. The aggregated target node is nonlinearly transformed to obtain the embedding representation of the target node under the behavior type, which may be represented as:

e_{i}^{k} = f (\sum_{v_{j s} \in N_{i}} m_{i \leftarrow j s}^{k})

(15)

e_{j s}^{k} = f (\sum_{u_{i} \in N_{j s}} m_{j s \leftarrow i}^{k})

(16)

where

N_{i}

and

N_{j s}

denote the set of neighborhood nodes of user

u_{i}

and item

v_{j s}

in the user–item interaction subgraph, respectively.

f (\cdot)

denotes the LeakyReLu activation function.

e_{i}^{k}

and

e_{j s}^{k}

denote the embedding representation obtained under behavior k after aggregation.

Based on the idea of an iterative multi-layer update embedding representation of the GCN model, we represent the updating process of the multi-behavior higher-order co-relation from layer 0 to layer L as:

e_{i} = \frac{1}{L + 1} \sum_{l = 0}^{L} \sum_{k = 1}^{K} e_{i}^{k, l}

(17)

e_{j} = \frac{1}{L + 1} \sum_{s \in S} \sum_{l = 0}^{L} \sum_{k = 1}^{K} e_{j s}^{k, l}

(18)

e_{k} = \sum_{l = 0}^{L} w_{r}^{l} e_{k}^{l}

(19)

where

S

is the set of subgraphs to which item

v_{j}

belongs,

e_{i}

denotes the embedding representation of user

u_{i}

after aggregating all behaviors and iterating over L layers,

e_{j}

denotes the embedding representation of item

v_{j}

after aggregating the set of subgraphs and behaviors and iterating over L layers,

e_{k}

denotes the embedding representation of behavior

r_{k}

, and

w_{k}^{l}

denotes the parameter matrix corresponding to layer

l

. In addition, a weight value

\frac{1}{L + 1}

is set for each layer embedding process.

3.6. Multi-Behavior Prediction Layer

In the prediction layer of this model, the final representations of user

u_{i}

, item

v_{j}

, and behavior

r_{k}

are obtained by combining the embedding representations obtained in each layer, as shown in Equations (18)–(20). With the learned embeddings of users

e_{i}

, items

e_{j}

, and behaviors

e_{k}

, the prediction of user preferences for items is calculated via the inner product as:

{\hat{y}}_{i, j}^{(k)} = e_{i}^{T} \cdot d i a g (e_{k}) \cdot e_{j}

(20)

where

d i a g (e_{k})

denotes the diagonal matrix with diagonal entries all equal to

e_{k}

. To optimize the model, we use pairwise loss for learning. Specifically, in small-batch training, we define the positive interaction terms of user

u_{i}

in

S

(for example,

v_{p_{1}}, v_{p_{2}}, \dots, v_{p_{S}}

) [13]. In the negative instances of the generation process, we randomly sampled

S

non-interactive terms

v_{n_{1}}, v_{n_{2}}, \dots, v_{n_{S}}

of user

u_{i}

. For a given sample of positive and negative instances, we define the loss function as:

L = \sum_{i = 1}^{I} \sum_{s = 1}^{S} \sum_{k = 1}^{K} m a x (0, 1 - {\hat{y}}_{i, p_{s}}^{(k)} + {\hat{y}}_{i, n_{s}}^{(k)}) + λ {‖Θ‖}_{F}^{2}

(21)

where

p_{s}

and

n_{s}

are the sth positive and negative instances, respectively.

λ

and

Θ

denote the regularization coefficients and parameters of the model, respectively. The 𝐿2 regularization method is used to prevent overfitting.

3.7. Model Complexity Analysis

The time cost of GTCF4MB mainly comes from two aspects. The information propagation layer costs

O = (K \times (I + J) \times d^{2})

for embedding transformation, where

K

,

I

,

J

, and

d

denote the behavior types, the number of users, the number of items, and the latent factors. Additionally, the prediction layer costs a minor

O = (S \times d^{2})

computational complexity. In conclusion, our GTCF4MB model could achieve a comparable model time complexity with graph-convolution-network-based multi-behavior recommendation methods.

4. Experiments

In this section, we introduce the dataset, evaluation metrics, and parameter settings and conduct experiments on different datasets to evaluate the performance of our GTCF4MB model via comparison with multiple recommendation techniques. An Intel(R) Xeon(R) Gold 6140 CPU @2.30GHz, NVIDIA Tesla T4 GPU, and Ubuntu 16.4 were used in the experiments. Experiments were conducted considering the following:

Performance of the GTCF4MB model compared to other models using three publicly available datasets.
Effect of removing the dynamic coding layer on the GTCF4MB model.
Effect of removing the multi-head attention mechanism on the GTCF4MB model.
Effect of removing the subgraph generation module on the GTCF4MB model.
Effect of varying the number of convolution layers on the GTCF4MB model.
Effect of varying the number of subgraphs on the GTCF4MB model.

4.1. Datasets

To evaluate the effectiveness of the proposed model, experiments were conducted on three datasets in this study. The datasets are described as follows.

Taobao dataset: this dataset was provided by Taobao, one of the largest e-commerce platforms in China, and contained data from four types of user–item relationships: view, add to cart, favorite, and purchase.

Beibei dataset: this dataset was provided by Beibei.com (accessed on 13 May 2022), the largest online retail site for baby products in China. It involved three types of user–item interactions: view, add to cart, and purchase.

MovieLens dataset: this dataset is a benchmark dataset widely used for evaluating the performance strengths and weaknesses of recommender systems. In this study we distinguished multiple behavior types based on evaluation scores of

r = 1, \dots, 5

. More specifically,

1 \leq r \leq 2

indicates disliked behavior,

2 < r < 4

indicates neutral behavior, and

r \geq 4

indicates liked behavior. In this dataset, the target and auxiliary behaviors were set as (like) and (neutral, dislike), respectively.

Based on the previous study setup, for each dataset, we randomly divided it into training, validation, and test sets in the ratio of 8:1:1. For each group of experiments, we averaged the results after 5 times.

4.2. Evaluation Metrics

For the performance evaluation, all experiments used two representative metrics, namely the hit ratio (HR) and normalized discounted cumulative gain (NDCG). These two metrics have been widely used in Top-N recommendation tasks [5,27], where higher or lower HR and NDCG scores can represent good or bad recommendation results. To evaluate the model effectively and fairly, each positive instance was sampled from the user’s interactive and non-interactive items with 99 negative instances. HR and NDCG are calculated as follows.

H R @ K = \frac{H i t s @ K}{U s e r s}

(22)

\{\begin{array}{l} D C G @ K = \sum_{i = 1}^{k} \frac{2^{r e l_{i}} - 1}{\log_{2} (i + 1)} \\ N D C G @ K = \frac{D C G @ K}{I D C G @ K} \end{array}

(23)

where

H i t @ K

indicates the number of users whose items in the test set appear in the Top-K recommendation list.

U s e r s

indicates the total number of users in the test set. DCG@K indicates the discounted cumulative gain and is also a parameter. IDCG@K is the maximum value obtained in DCG@K, which is a list of the best recommendations returned by a user.

k

is the size of the recommendation list.

r e l_{i}

is the relevance of the ith position of the recommendation list.

4.3. Parameter Setting

In this experiment, the proposed GTCF4MB model was implemented using the TensorFlow framework and the model was optimized using the Adam optimizer in the training phase. The default learning rate of the model was 1 × 10⁻³, the decay rate was 0.96, the multi-headed attention was fixed at 2, the regularization coefficients λ were chosen from the set 0.1, 0.05, 0.01, 0.005, 0.001, and batch size was chosen from the set 32, 128, 256.

4.4. Baselines

To comprehensively assess the model performance, the GTCF4MB model was compared with other benchmark research methods with different research objectives in this study via experimentation, as described below.

Neural collaborative filtering methods

NCF [1]: this approach uses neural networks to enhance the recommendation performance of collaborative filtering. In this study, we considered three variants of different modeling approaches for NCF: (1) combining matrix decomposition and multi-layer perceptron methods (i.e., NCF-N); (2) modeling user–item interactions using a multi-layer perceptron (i.e., NCF-M); and (3) performing a fixed product of elements for user and item embeddings (i.e., NCF-G).

Graph neural network collaborative filtering method

NGCF [5]: this method is a graph neural collaborative filtering model that maps user items into potential representations using a model of information transfer with higher-order neighborhood connectivity relations.

Collaborative filtering model based on an autoencoder

AutoRec [3]: this method maps user-object interaction features to potential low-dimensional representations by stacking multiple layers of perceptrons.

Recommendation models for multi-behavior relationships

NGCF_M [5]: this method combines multi-behavioral relational learning under graph neural networks with a graph collaborative filtering model.

NMTR [28]: this approach is a multitask recommendation framework designed as a shared embedding layer for different types of behaviors.

MBGCN [10]: this method utilizes a multiple user–item interaction behavior model and performs behavior-aware embedding propagation through graph convolutional networks.

MB-GMN [11]: this approach extracts user-personalized information through meta-learning and injects it into a graph migration learning-based framework, which allows learning and provides more accurate recommendations for users’ complex interests.

GNMR [13]: this method models different types of user–item interactions under the message-passing structure of graph neural networks and designs a relational aggregation network for embedding propagation over the interaction graph.

4.5. Experimental Results and Analysis

After the experiments, the performance results of each model are shown in Table 1. We first analyzed the performance of the GTCF4MB model compared with the other considered models for the three datasets in Table 1. The results show that the GTCF4MB model outperformed the other models in terms of both the HR@10 and NDCG@10 metrics. It is observed that the performance of the graph-neural-network-based model was better than that of the self-encoder-based model, indicating the rationality of mining higher-order collaborative information on user–item relationships.

In comparing the multi-behavior recommendation models based on the Taobao dataset, the HR@10 metrics of the GTCF4MB model were improved by 32.96%, 9.81%, and 22.22% over the latest baseline models MBGCN, MB-GMN, and GNMR, respectively. The NDCG@10 metrics of the GTCF4MB model improved by 32.71%, 8.64%, and 27.46% over those of the latest baseline models MBGCN, MB-GMN, and GNMR, respectively.

In the Beibei dataset, the HR@10 metrics of the GTCF4MB model improved by 12.01%, 5.24%, and 8.56% over the latest baseline models MBGCN, MB-GMN, and GNMR, respectively. The NDCG@10 metrics of the GTCF4MB model improved by 12.08%, 4.50%, and 6.16% over those of the latest baseline models MBGCN, MB-GMN, and GNMR, respectively.

In the MovieLens dataset, the HR@10 metrics of the GTCF4MB model improved by 10.29%, 2.21%, and 6.31% over the latest baseline models, MBGCN, MB-GMN, and GNMR, respectively. The NDCG@10 metrics of the GTCF4MB model improved by 12.21%, 2.93%, and 7.00% over the latest baseline models, MBGCN, MB-GMN, and GNMR, respectively.

As seen in Table 1, the results obtained from the experiments conducted on the Taobao dataset show that the HR@10 performance metrics were generally lower for all the models. This is probably due to the fact that the Taobao dataset was sparse compared to the other two datasets, and also the metric was generally low in previous studies.

It can also be seen from Table 2 that the GTCF4MB model achieved a relatively improved performance under different @K values with the Beibei dataset. The above results indicate the feasibility of introducing a multi-headed attention mechanism that can model behavior-specific user–item interaction features better than the traditional graph convolutional network based on the synergistic information of higher-order neighborhoods, which can deeply and adaptively determine the preferences of different user behaviors and improve the recommendation performance.

4.6. Ablation Study

To verify the validity of the GTCF4MB model submodule for the experiment, this model was evaluated with the removal of several variables for the three datasets based on HR@10 and NDCG@10:

GTCF4MB-Ti: This variant model of GTCF4MB removes the dynamic encoding layer.
GTCF4MB-At: This variant model of GTCF4MB removes the multi-head attention mechanism.
GTCF4MB-SG: This variant model of GTCF4MB removes the sub-graph generation module.

As shown in Table 3, the GTCF4MB model performed better than the GTCF4MB models with removed variables. Therefore, the following conclusions may be drawn: (1) adding a dynamic coding layer could better capture the dynamic multi-behavior information of users and positively affected the model compared to static fixed processing; (2) the multi-headed attention mechanism had a large impact on the model performance, and removing the multi-headed attention mechanism lead to a significant decrease in the model performance, indicating the importance of this module that adaptively models user items under specific behaviors; and (3) the subgraph generation module mainly affected the number of layers of iterative graph convolution, which is analyzed in detail in Section 4.7.

4.7. Hyperparameter Analysis

To investigate the effectiveness of the GTCF4MB model in deeper hierarchical structures, we increased the depth of the model. Therefore, to verify whether the investigated number of convolutional layers

l

and number of subgraphs

s

that are generated can mitigate the over-smoothing problem, the following experiments were conducted by setting the number of convolutional layers l = 2,3,4,5,6,7 and the number of subgraphs s = 2,3,4. The experimental results are shown in Figure 4.

First, the effect of the number of layers of graph convolution was analyzed. As shown in Figure 4, setting the GTCF4MB model to different subgraph numbers exhibited improved HR@10 values after iterating with more than three or four layers. In contrast, considering the previously developed models evaluated in this study, their performances degraded owing to the over-smoothing problem after the number of iterative graph convolution layers exceeded three or four layers. This indicates that the model developed in this study exhibits a slightly improved performance after undergoing iterative multi-layer convolution, especially when six or seven layers are utilized. However, if the number of layers is increased beyond seven, a node aggregates almost all higher-order information; therefore, in the experiment, the maximum number of layers was set to seven, and continuing to stack layers did not significantly affect the experimental results. The above results show that the developed model can effectively mitigate the over-smoothing problem.

Next, the effect of the number of subgraph generations was analyzed. The results in Figure 4 show that for both datasets, the GTCF4MB_2 model exhibited a better performance when the number of layers did not exceed three; when the number of layers exceeded three, both the GTCF4MB_3 model and GTCF4MB_4 model exhibited improved performances compared to the GTCF4MB_2 model. This occured because when more than three layers are utilized, the amount of information from nodes that other nodes are exposed to during information propagation increases dramatically, and therefore the GTCF4MB_3 and GTCF4MB_4 models with higher numbers of subgraph sets exhibited higher HR values. As the number of layers continued to stack, the GTCF4MB_4 model did not exhibit an improved performance, which may be due to the increased number of subgraphs causing the model to focus more on the extraction of higher-order neighborhood information and ignore the information from lower-order neighborhoods. This may cause the performance of the GTCF4MB_4 model to plateau as number of layers is increased.

5. Conclusions

In this study, we proposed a graph transformer collaborative filtering method for multi-behavior recommendation to achieve an improved recommendation performance for multi-behavior data. This system classifies users with similar preferences into the same group by utilizing a subgraph generation module in an unsupervised manner to propagate node information embedded within the subgraph. This method effectively alleviates the over-smoothing problem caused by increasing the number of graph convolution layers. Simultaneously, dynamic coding layers and a multiheaded self-attention mechanism were introduced to dynamically embed user behavior and item representations during embedding propagation. This permits the proposed method to explore user preferences obtained under different behaviors at a deeper level. In all three datasets, the GTCF4MB model achieved a better performance than all other evaluated models.

In summary, we argued that classifying users with similar preferences in the user–item interaction graph into a subgraph not only avoids the noisy information generated by other users who do not have similar preferences or for other reasons, but also improves the embedding representation of users. Then, we introduced a multi-headed attention mechanism with the aim of mapping the embedding representation into the attention feature space by query vectors, key vectors, and value vectors (i.e., Q, K, V). The results show that the embedding representation can be effectively enhanced by this method.

Considering that most current recommendation models are supervised models that utilize user–item interactions as supervised signals, it is difficult to obtain good recommendation results when the user–item interaction data is sparse. Therefore, in future work, methods such as self-supervised graph learning should be incorporated into recommendation systems to address the problem of sparse data caused by relying on supervised signals.

Author Contributions

Conceptualization, W.Z. and Z.Z.; Investigation, W.Z. and Q.H.; Methodology, W.Z. and Y.H.; Validation, W.Z. and Y.X.; Writing—original draft W.S. and X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Guangdong Science and Technology Planning Project under Grant 2019B010121001 and 2019B010118001, National Key Research and Development Project under Grant 2018YFB1802400.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

These research datasets can be found at https://tianchi.aliyun.com/dataset/dataDetail?dataId=649, https://www.beibei.com and https://grouplens.org/datasets/movielens/ (accessed on 13 May 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; Chua, T.-S. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 May 2017; pp. 173–182. [Google Scholar]
Xue, H.-J.; Dai, X.; Zhang, J.; Huang, S.; Chen, J. Deep Matrix Factorization Models for Recommender Systems. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI, Melbourne, Australia, 19–25 August 2017; Volume 17, pp. 3203–3209. [Google Scholar]
Sedhain, S.; Menon, A.K.; Sanner, S.; Xie, L. Autorec: Autoencoders Meet Collaborative Filtering. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 111–112. [Google Scholar]
Wu, Y.; DuBois, C.; Zheng, A.X.; Ester, M. Collaborative Denoising Auto-encoders for Top-n Recommender Systems. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, 22–25 February 2016; pp. 153–162. [Google Scholar]
Wang, X.; He, X.; Wang, M.; Feng, F.; Chua, T.-S. Neural Graph Collaborative Filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 165–174. [Google Scholar]
Welling, M.; Kipf, T.N. Semi-supervised Classification With Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
Chen, L.; Wu, L.; Hong, R.; Zhang, K.; Wang, M. Revisiting Graph Based Collaborative Filtering: A Linear Residual Graph Convolutional Network Approach. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 27–34. [Google Scholar] [CrossRef]
He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; Wang, M. Lightgcn: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Xi’an, China, 25–30 July 2020; pp. 639–648. [Google Scholar]
Yu, L.; Zhang, C.; Pei, S.; Sun, G.; Zhang, X. Walkranker: A Unified Pairwise Ranking Model With Multiple Relations for Item Recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Jin, B.; Gao, C.; He, X.; Jin, D.; Li, Y. Multi-behavior Recommendation With Graph Convolutional Networks. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Xi’an, China, 25–30 July 2020; pp. 659–668. [Google Scholar]
Xia, L.; Xu, Y.; Huang, C.; Dai, P.; Bo, L. Graph Meta Network for Multi-behavior Recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Online, 11–15 July 2021; pp. 757–766. [Google Scholar]
Chen, C.; Ma, W.; Zhang, M.; Wang, Z.; He, X.; Wang, C.; Liu, Y.; Ma, S. Graph Heterogeneous Multi-relational Recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; 35, pp. 3958–3966. [Google Scholar]
Xia, L.; Huang, C.; Xu, Y.; Dai, P.; Lu, M.; Bo, L. Multi-Behavior Enhanced Recommendation with Cross-Interaction Collaborative Relation Modeling. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 1931–1936. [Google Scholar]
Huang, C.; Xu, H.; Xu, Y.; Dai, P.; Xia, L.; Lu, M.; Bo, L.; Xing, H.; Lai, X.; Ye, Y. Knowledge-Aware Coupled Graph Neural Network for Social Recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 4115–4122. [Google Scholar]
Fan, W.; Ma, Y.; Li, Q.; He, Y.; Zhao, E.; Tang, J.; Yin, D. Graph Neural Networks for Social Recommendation. In Proceedings of the International World Wide Web Conferences, Francisco, CA, USA, 13–17 May 2019; pp. 417–426. [Google Scholar]
Wang, H.; Zhang, F.; Zhang, M.; Leskovec, J.; Zhao, M.; Li, W.; Wang, Z. Knowledge-Aware Graph Neural Networks With Label Smoothness Regularization for Recommender Systems. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 968–977. [Google Scholar]
Wang, X.; He, X.; Cao, Y.; Liu, M.; Chua, T.-S. Kgat: Knowledge Graph Attention Network for Recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 950–958. [Google Scholar]
Liu, K.; Xue, F.; Hong, R. RGCF: Refined Graph Convolution Collaborative Filtering With Concise and Expressive Embedding. Intell. Data Anal. 2022, 26, 427–445. [Google Scholar] [CrossRef]
Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying Graph Convolutional Networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6861–6871. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y.; Liu, T.-Y. Do Transformers Really Perform Badly for Graph Representation? Adv. Neural Inf. Process. Syst. 2021, 34, 28877–28888. [Google Scholar]
Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph Transformer Networks. Adv. Neural Inf. Process. Syst. 2019, 32, 11983–11993. [Google Scholar]
Hu, Z.; Dong, Y.; Wang, K.; Sun, Y. Heterogeneous Graph Transformer. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2704–2710. [Google Scholar]
Hu, Y.; Zhan, P.; Xu, Y.; Zhao, J.; Li, Y.; Li, X. Temporal Representation Learning for Time Series Classification. Neural Comput. Appl. 2021, 33, 3169–3182. [Google Scholar] [CrossRef]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 30, p. 3. [Google Scholar]
Xia, L.; Huang, C.; Xu, Y.; Dai, P.; Zhang, X.; Yang, H.; Pei, J.; Bo, L. Knowledge-Enhanced Hierarchical Graph Transformer Network for Multi-behavior Recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; 35, pp. 4486–4493. [Google Scholar]
Bai, T.; Zou, L.; Zhao, W.X.; Du, P.; Liu, W.; Nie, J.-Y.; Wen, J.-R. CTrec: A Long-Short Demands Evolution Model for Continuous-Time Recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 675–684. [Google Scholar]
Gao, C.; He, X.; Gan, D.; Chen, X.; Feng, F.; Li, Y.; Chua, T.-S.; Jin, D. Neural Multi-task Recommendation From Multi-behavior Data. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macau, China, 8–11 April 2019; pp. 1554–1557. [Google Scholar]

Figure 1. Multi-behavior interaction graph.

Figure 2. Subgraph generation module.

Figure 3. Information embedding propagation module.

Figure 4. Effects of different hyperparameters: (a) performance comparison using different numbers of layers and subgraphs on Beibei; (b) performance comparison using different numbers of layers and subgraphs on Taobao.

Table 1. Performance comparison of the models on the Taobao, Beibei, and MovieLens datasets.

Dataset	Taobao		Beibei		MovieLens
Model	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
NCF-N	0.325	0.201	0.601	0.336	0.801	0.518
NGCF	0.302	0.185	0.611	0.375	0.790	0.508
NGCFM	0.374	0.221	0.634	0.372	0.825	0.546
NMTR	0.332	0.179	0.613	0.349	0.808	0.531
AutoRec	0.313	0.190	0.607	0.341	0.658	0.392
MBGCN	0.362	0.218	0.637	0.371	0.810	0.539
MB-GMN	0.487	0.296	0.686	0.403	0.883	0.596
GNMR	0.420	0.235	0.662	0.396	0.846	0.571
GTCF4MB	0.540	0.324	0.724	0.422	0.903	0.614

Table 2. Performance comparison of all models for different HR@K and NDCG@K on Beibei.

Model	@1		@3		@5		@7
Model	HR	NDCG	HR	NDCG	HR	NDCG	HR	NDCG
NCF-N	0.123	0.123	0.317	0.232	0.447	0.283	0.530	0.315
NGCFM	0.167	0.167	0.376	0.286	0.496	0.337	0.569	0.361
MBGCN	0.167	0.167	0.374	0.284	0.498	0.337	0.579	0.363
MB-GMN	0.183	0.183	0.401	0.306	0.527	0.359	0.608	0.389
GNMR	0.175	0.175	0.384	0.295	0.490	0.341	0.582	0.376
GTCF4MB	0.186	0.186	0.415	0.317	0.545	0.370	0.631	0.402

Table 3. Performance comparison of the GTCF4MB models with different variables.

Model	Taobao		Beibei		MovieLens
Model	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
GTCF4MB-Ti	0.528	0.317	0.711	0.409	0.851	0.572
GTCF4MB-At	0.387	0.236	0.612	0.370	0.816	0.518
GTCF4MB-SG	0.468	0.274	0.621	0.362	0.836	0.557
GTCF4MB	0.540	0.324	0.724	0.422	0.903	0.614

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, W.; Xie, Y.; Huang, Q.; Zheng, Z.; Fang, X.; Huang, Y.; Sun, W. Graph Transformer Collaborative Filtering Method for Multi-Behavior Recommendations. Mathematics 2022, 10, 2956. https://doi.org/10.3390/math10162956

AMA Style

Zhu W, Xie Y, Huang Q, Zheng Z, Fang X, Huang Y, Sun W. Graph Transformer Collaborative Filtering Method for Multi-Behavior Recommendations. Mathematics. 2022; 10(16):2956. https://doi.org/10.3390/math10162956

Chicago/Turabian Style

Zhu, Wenhao, Yujun Xie, Qun Huang, Zehua Zheng, Xiaozhao Fang, Yonghui Huang, and Weijun Sun. 2022. "Graph Transformer Collaborative Filtering Method for Multi-Behavior Recommendations" Mathematics 10, no. 16: 2956. https://doi.org/10.3390/math10162956

APA Style

Zhu, W., Xie, Y., Huang, Q., Zheng, Z., Fang, X., Huang, Y., & Sun, W. (2022). Graph Transformer Collaborative Filtering Method for Multi-Behavior Recommendations. Mathematics, 10(16), 2956. https://doi.org/10.3390/math10162956

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Graph Transformer Collaborative Filtering Method for Multi-Behavior Recommendations

Abstract

1. Introduction

2. Related Work

2.1. Recommendation Methods for Multi-Behavior Models

2.2. Graph Convolution Network

2.3. Transformer

3. Method

3.1. Problem Definition

3.2. Subgraph Generation Module

3.3. Dynamic Encoding Layer

3.4. Information Propagation Layer

3.5. Information Aggregation Layer

3.6. Multi-Behavior Prediction Layer

3.7. Model Complexity Analysis

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Parameter Setting

4.4. Baselines

4.5. Experimental Results and Analysis

4.6. Ablation Study

4.7. Hyperparameter Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI