Firstly, the algorithm uses a distributed web crawler to obtain all kinds of item images and user data from the item pages of the e-commerce websites and store them in the database as the source data of the data pool. Then, the images in the data pool are preprocessed and input into the MCNN to extract the multi-dimensional features of the images. When the user inputs an item image, the MCNN method is also used to extract the multi-dimensional features of the target image to obtain the semantic feature description and image feature map of the image. The style dimension of the product image is used as the input of the cross-domain recommendations. At the same time, target alternating attention is utilized, image feature map is input into the module, and the pixel contextual information is aggregated by image correlation to enhance the item information.
The user social records and historical purchase preferences are fused to obtain the users’ cross-platform data. The long-term and short-term preferences of the user are extracted, and the fusion preferences of the user are obtained by combining the long-term and short-term preferences. The affiliated users are mined from user historical preference items, and the user affiliation network is constructed based on the user social network. When the user interacts with an item, the algorithm automatically determines which affiliated user the item conforms to and adds it to the corresponding part of the user affiliation network to obtain the user multi-dimensional preferences. User preferences and item features are taken as the basis for cross-domain recommendations, and the implicit cross-fusion data between users and items are found through common dimension characteristics in the user affiliation network and image features. Based on the style dimension, this paper migrates the style features to more domains.
3.1. Multi-Channel Convolutional Neural Network Architecture
Given the problem that traditional convolutional neural networks cannot simultaneously extract features from multiple dimensions contained in e-commerce images, the network designed in this paper has multiple channels, each of which corresponds to the learning task of different dimensions of improving the feature extraction ability of multi-dimensional images.
Each user has their color or style preferences when shopping (extracted by attention in the user preference extraction section below), so users not only pay attention to the categories of items but also pay attention to the color and style of items and other information. Multi-attribute image means that an e-commerce image contains multiple dimensions. For example, a skirt can be described as a black preppy skirt, a dress can be described as a white French dress, a T-shirt can be described as a white leisure T-shirt, and a handbag can be described as a white ins handbag. The example are shown in
Table 1. Each dimension describes e-commerce images from different angles and levels [
19].
For the three dimensions of e-commerce image category, color, and style, the extracted features contribute to the classification of these three dimensions. Some prior distributions or model parameters can be shared in the learning tasks of these three dimensions, and these prior distributions or model parameters can be transferred during the training process. Fine-tuning is a common method in migration learning. It refers to using existing parameters to initialize new network parameters and transfer some tasks in the pre-training model to other tasks. In this way, the network can start learning from a good initial point, which can greatly save time when training new tasks.
The MCNN is shown in
Figure 3. The former part of the network, like the traditional network, has four convolutional layers, and each convolutional layer is connected with a pooling layer. The three dimensions share parameters in the first four convolution layers. From the fourth pooling layer, the network is divided into three channels. Each channel consists of two convolutional layers, one pooled layer, three fully connected layers, and a final Softmax classifier. The first channel is trained and classified according to the dimension of category, the second channel is trained and classified according to color, and the third channel is trained and classified according to style. Because the maximum pooling can learn the texture structure of the image, all network pooling modes are maximum pooling. In the latter part of the network, the network parameters of the three channels are the same, but the first channel outputs 12-dimensional vectors at the last fully connected layer, corresponding to 12 categories of the kind of dimension; the second channel outputs 3-D vectors, corresponding to the RGB value of the color dimension; the third channel outputs 12-dimensional vectors, corresponding to 12 categories of the style dimension. These three vectors will be input into three Softmax classifiers for classification, and the higher the output value, the greater the probability of belonging to the corresponding category.
When an image is input, three learning tasks can be carried out simultaneously; that is, the categories of kind, color, and style of the image can be predicted simultaneously through three classifiers.
This paper focuses on clothing, digital, and other items on JD, Taobao, Amazon, and other platforms. One-hot coding is carried out for the major categories.
The training set image represented as (
,
,
,
), (
,
,
,
), (
,
,
,
), …, (
,
,
,
), including
∈
R200×200,
∈
R12,
∈
R3,
∈
R12.
is a third-order tensor with a size of 200 × 200 for each image.
is a 12-dimensional binary vector, and each dimension corresponds to 12 categories of images, the 12 categories are listed in
Table 2, including women’s wear, ornaments, etc. Among them, others represents the item category that does not belong to the above 11 categories;
uses RGB color mode;
is a 12-dimensional binary vector, and each dimension corresponds to 12 styles of images, the 12 stylesare listed in
Table 3, including simple, retro, etc. Among them, others represents the item styles that does not belong to the above 11 styles. The expression of
and
using one-hot encoding is shown in
Table 2 and
Table 3.
, and respectively, represent the category, color, and style of the image, such as when = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), = (0, 0, 0) and = (0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0), the image is a black Korean women’s image.
All images in the graphic library are classified according to category, color, and style using an MCNN network, and corresponding images in the database are marked. When a user uploads an interested item image, the MCNN determines the features of category, color, and style of the image .
3.2. Target Alternating Attention
To aggregate more complete image information, this paper proposes target alternating attention (TAA), which aggregates the pixel contextual information of central pixel points in each surrounding layer according to the target to enhance item features, obtain important information more efficiently, and enhance the integrity and reliability of semantic feature description.
Target alternating attention collects pixel contextual information in a hierarchical direction to enhance the pixel-level representation ability. A frame diagram of target alternating attention is shown in
Figure 4. Each pixel in the image is traversed from top to bottom and from left to right. For each pixel, the pixel contextual information of the center pixel around the pixel is mined from inside to outside according to the target, until the outermost pixel is traversed, to obtain more accurate recommendation results.
The input image is processed by the MCNN to generate the feature map X with space size . Given X, we first apply the convolutional layer to obtain the reduced-dimension feature map , and then send the feature map into the target alternating attention. After the first loop, a new feature map is generated, which aggregates the information of the four pixels of upper left, upper right, lower left, and lower right. To obtain richer and denser pixel contextual information, we send the feature map again into the target alternating attention and output feature map . This process is repeated R times until the boundary pixel point position of the image, so that each position in the feature map actually collects information from all pixels.
Target alternating attention collects information about four pixels per loop. Given the local feature map , this module firstly applies two convolutional layers with 1 × 1 filter to to generate two feature maps, Q and K, where . is the number of channels, less than C for dimension reduction.
After obtaining the feature maps Q and K, the attention map
is further generated through Affinity. At each position
of the spatial dimension of feature map Q, the vector
is obtained. At the same time, the set
can be obtained by extracting four feature vectors from K: upper left, upper right, lower left, lower right, or up, down, left, and right on each ring from position
outward.
is the
th element of
. The Affinity operation is defined as follows:
where the
is the feature correlation between
and
,
],
. Then, we apply the softmax layer on channel dimension D to calculate the attention map M.
Another convolutional layer with a 1 × 1 filter is applied to H(0) to generate
for feature adaptation. At each position
of the spatial dimension in the feature map V, we obtain the vector
and a set of
. The set
is a set of four feature vectors: upper left, upper right, lower left, lower right, or up, down, left, and right, on each outward loop of V. The pixel contextual information is collected by the Aggregation operation:
where
is the output feature mapping
at the
th layer circulation position
.
is the scalar value in
at channel
and position
. Adding pixel contextual information to local feature H can enhances the representation of local feature and pixel.
Target alternating attention can be expanded into an R loop. In the first loop, the target alternating attention takes the feature map H(0) extracted from the MCNN network as input and outputs the feature map H(1), where H(0) and H(1) have the same shape. In the
th loop, the attention takes the feature map H(j-1) as input and outputs the feature map H(j). As shown in
Figure 4, the target alternating is equipped with R loops, which can obtain complete image pixel contextual information from all pixels to generate new feature maps with dense and rich pixel contextual information.
is represented as the attention map of the jth layer loop. For any pixel in the image
, from the position
,
to the weight of the pixel
mapping function is defined as
, the loop
for the feature map H(j) in any position
, propagation path of pixel contextual information in spatial dimension:
where
is added to operation;
is the pixel point for which information is to be collected for each loop of the target.
In e-commerce images, different colors often have specific styles, so after the pixel contextual information of the image is aggregated through the alternating attention, the image style is associated with color and category channels to a certain extent, which is used to enhance the extraction and representation of image style features.
3.3. User Affiliation Network
To make full use of the implicit social relationships between users, the user affiliation network is defined to realize more dimensional and efficient links between users with the help of more media. Through the construction of a cross-domain user affiliation network, the algorithm can automatically identify which affiliated users the items belong to, distinguish them from the real person, enhance the user information, and realize more accurate and comprehensive recommendations. For users, recommendations from multi-angle and multi-aspect not only improve the diversity of e-commerce recommendations but also help users break the information cocoon effect. For business, the accuracy and diversity of the recommender system play a crucial role in the e-commerce behavior of users, which determines whether the item can be found by users.
3.3.1. User Preference Combination Framework
Many user interests may change over time and may be triggered by specific contexts or time requirements. The long-term preference sequence of users is very rich and relatively stable, reflecting the overall trend of user interests, but there is a lot of redundant information. The short-term preference sequence of users can more accurately reflect the changes in user interests in a short period, which plays a major role in the prediction, but it can also be easily affected by a single item. To make the recommender system accurately predict user interests, this paper combines long-term and short-term preferences, which can not only accurately grasp the overall trend of user preferences but also effectively reflect the evolution of user demands.
For the item set I interacting with users, GRUs are used to extract user short-term preferences, and different weights are assigned to user long-term preferences and immediate interests for item attributes through attention. Weighted calculation is used to model the final results of short-term interests and long-term preferences at the same time to obtain the vector representation of user preferences. When a user purchases an item, the system adds corresponding descriptions of category, color, and style to the user preference record according to the text description of the item in the item library and updates the user preference in time.
Among all the historical behaviors of users, only a part of historical behaviors can effectively affect the predictive attributes of the current recommended items, and each historical behavior has different impacts and contributions to user preferences. Therefore, attention is adopted to extract the importance of the user historical behaviors to the current recommended, namely the difference in weight.
For example, during festivals, people will buy some items in line with the festive atmosphere, but these items are rarely purchased at ordinary times, which may be different from the category, color, and style of the items purchased by users. The attributes of these items will have a negative impact on the accuracy of predicting user preferences. Therefore, when using attention to extract user long-term preferences, in the non-festive period, less attention is paid to items with festive significance; that is, less weight is given. During festive times, increased attention and greater weight are given to items that fit with the times. Similarly, since many products are only used during certain seasons, the attention assigns different weights depending on the season and item information.
represents festive and season attributes, different users have different purchase preferences in different festival seasons, and each user is personalized to represent whether there is a special purchase preference at a certain time
, such as birthday, etc. Then, the probability
of user
buying a certain item
i is a function of time
t:
where
is the quantity of item
purchased by all users at time
, and the larger
is, the more likely users are to buy similar items again at time
.
Taking the user short-term preference
extracted in chronological order as the input of Attention, the influence weight
of each item on the current recommendations in the historical behavior is a function of time
:
The user long-term preference
is a function of the user historical behavior of items and short-term preferences:
The user long-term preference
and user short-term preference h are combined in a non-linear manner to obtain user long-term and short-term preference fusion
:
where
is a non-linear function.
Since the user long-term preference can reflect the user preferred style, the user preferred style can be found in the extraction of the user long-term preference, and the style of users can be compared with the style of the item image in the recommendations. The cross-fusion data of user items can be extracted in the following for cross-domain recommendations.
3.3.2. User Affiliation Network
The schematic diagram of the affiliation network is shown in
Figure 5. For each user, their affiliated users include both indirect and virtual users. In
Figure 5, the pink figure means the real user, and the gray figure is the affiliated user. If users buy an item that does not fit their age profile, a virtual person is built for them. The preference matrix of affiliated users is constructed for each real user node. When the user purchases items, the algorithm automatically identifies which node belongs to according to the item information. When making recommendations, in addition to recommending the user’s own favorite items, users can also be recommended items in line with the preferences of other affiliated users. For example, if a middle-aged woman buys a children’s dress, a daughter’s affiliated user will be added to her affiliation network. The preference information of the affiliated user will be recorded according to the color and style of the dress, and recommend children’s items with the same or similar style according to the purchased dress to the user.
Definition 1. Virtual users. If a real user purchases an item that does not belong to him or her, a corresponding affiliated user is constructed, such as father, mother, child, friend, etc. represents the affiliated user node; then, the affiliated user node is:where is the extraction operation, is all the items purchased by user , is the items purchased by user that conform to user identity information, and is the set of items selected by the same affiliated user. Definition 2. Indirect user. For user , if there is another user with the same or similar preferences as in the category, color, and style of items, is the indirect user of , represented by ind; then, the indirect user node ind of user is:where indicates similar preferences. The preference relationship between the indirect and virtual user relative to the user itself is the affiliated relationships. The affiliated relationships of indirect users refer to the idea of collaborative filtering. If users have similar preferences and they have no social relationship, they are indirect users of each other. The affiliation of virtual users is information about what the user has purchased for them. When making recommendations, the virtual user considers not only the user themself but also the affiliated relationships of the user as a part of the user data, and makes recommendations separately from the user preference.
Definition 3. Increasing preference. The increasing preference pays attention to the changes in the interests of affiliated users, recommends items they continue to like in their historical preferences, and conducts interest testing with a small number of novel items that have never appeared, which is the key to interact and influence with the surrounding users.
Whether between users or between users and affiliated users, purchase behavior is crucial to the generation and change in the relationship chain between users. the relationship chain means that real users have the same preference, real users buy items for affiliated users, and the identity of affiliated users is consistent with the algorithm judgment.
Through the items purchased by users for different affiliated users, the preference information of affiliated users is obtained and, based on this, the user affiliation network is constructed to enhance the user preferences.
Definition 4. Affiliated user preference attributes.
In the set of items purchased by the user for others, judge which affiliated users belong to according to the item features and add to the corresponding attribute; then, the preference attribute of affiliated user is: In the user affiliation network, the interaction between users and the purchase of items for affiliated users is an explicit relationship, while the relationships between users with the same preference and affiliated users with the same identity are implicit.
Definition 5. User affiliation network. The user affiliation network, such as self, parents, friends, and children, is constructed according to the age and purchase information of real users. Each affiliated user is a node of the network, where stands for affiliation network and stands for real user node. Then, user affiliation network is:where 1 represents a solid line in the figure, which means the node itself has a social relationship; 2 represents a dotted line in the figure, which means the user’s indirect or affiliated user; 0 means there is no edge in the figure. The preferences of affiliated users will also affect the purchase demand of users. The user knowledge is enhanced according to the preferences of affiliated users to predict the preferences of other users with the same affiliated users and recommend the items purchased by the user to other users while recommending the items matching the identity or complementing the affiliated users to the user.
Based on the user and item dimension characteristics, the category, color, and style characteristics extracted above are predicted. In the affiliation network, the affiliated user is considered to be an independent entity in the recommendations, and user preferences are mined from the dimension perspective to make user preferences clear, which is conducive to more accurate discovery of user interests.
Definition 6. User linkage relationship. is the affiliated user of , is the affiliated user of , and , belong to the affiliated user of the same identity. Then, when buys item for , the algorithm will automatically recommend the to and recommend the matching item for . For example, if is 30-year-old female, and are children, and buys ice skates for , sports clothes are recommended for and ice skates for .where and are non-linear relations. 3.3.3. User Item Joint Recommendations
The phenomenon of information cocoon is quite common in recommender systems. The main reason is that individuals pursue personalized subjective needs, and the development of algorithm recommendation technology makes it more obvious.
User IDs and item IDs are encoded as one-hot, and the attribute data of the item are encoded as multi-hot, meaning that an item may correspond to multiple dimensions.
Definition 7. User item joint recommendations. Mine the similar attribute between user item target/source and recommend items with similar item style attributes for users with the same or similar attributes, as shown in Figure 6. If the target user is a 30-year-old female, and the historical preference is a simple bag, and there is another 30-year-old female in the user database who likes simple dress, simple dress is recommended for the target user.where is user source, is the category preference of user source, is the style preference of user source, is user target, is the style preference of user target, and is item source. Jaccard distance is used to calculate the similarity between user target and user source, item target and item source, and Top-n recommendation is generated according to the similarity.