1. Introduction
Information overload is an important issue experienced by users when choosing and purchasing products, which prevents them from easily discovering items that match their preferences. The role of recommender system(s) (RS) in providing support to users to overcome this serious problem is unquestionable. These systems provide recommendations relevant to users and create better exposure for items. The usefulness of RS is evidenced by their increasing incorporation in multiple areas including E-commerce platforms, social networks, lifestyle apps, and so on [
1,
2,
3,
4,
5].
Much effort has been devoted to addressing the problems affecting RS and improving the performance of recommendations. However, researchers are focusing on mitigating different types of biases, which generally result in a decrease in performance. Bias shortcomings, commonly observed in machine learning models, lead to various forms of discrimination. Bias and fairness issues are some of the most critical problems that RS can face. The underlying imbalances and inequalities in data can introduce bias to RS while learning patterns from these historical data [
6,
7]. This translates into biased recommendations that influence the user’s consumption decisions. In addition, the items consumed as a result of these recommendations are incorporated into the data used to generate new models. This makes the data to become increasingly biased and the problem of providing unfair recommendations become worse. However, the data are not the only cause of these inequalities, the design of the algorithms used can result in bias and automatic discrimination in decisions [
7,
8].
Due to the high implementation of machine learning technologies in society, the negative repercussions of biased models cover a wide range of aspects, including economic, legal, ethical, or security issues, which are detrimental to companies [
6,
9,
10,
11,
12]. Moreover, bias can lead to user dissatisfaction [
13]. Apart from these reasons, some international regulations, such as those of the European Union, include the obligation to minimize the risk of biased results in artificial intelligence systems that support decision-making in critical areas [
9]. Decisions based on these results often have ethical implications from Aristotle’s point of view regarding fairness, which may involve discriminatory treatment of minority or marginalized groups. For instance, some studies indicate that in the music recommendation field, female and older users are provided with a poorer quality of recommendations [
14]. Those segments of the population that suffer the negative consequences of bias are the so-called protected groups on which mitigation strategies should focus. All of these facts have driven the research currently being carried out in this field [
5,
15,
16,
17].
Deep learning techniques in the RS area, which are used to improve the performances [
18,
19,
20], have led to bias amplification in recommendations due to the greater propensity of these methods to magnify this problem. Within this approach, the methods based on graph neural networks (GNNs) perform well in this application domain, although they are still affected by the bias shortcoming. Some studies have shown that graph structures and the used message-passing system inside GNNs promote the amplification of unfairness and other social biases [
21,
22]. Moreover, in most of the social networks with graph architecture, nodes with similar sensitive attributes are prone to be connected in comparison to other nodes with different sensitive attributes (e.g., young individuals are more likely to start a friendship in social networks). The mentioned phenomenon creates an environment in which nodes with similar sensitive features receive similar representations from the aggregation of neighbors’ features in GNN, while different representations are provided for the nodes with different sensitive features. This will lead to significant bias in the decision-making process [
22].
There are various types of GNN-based RS, i.e., for conventional recommendations and sequential or session-based recommendations [
23]. The most extended are graph convolutional network (GCN), graph attention network (GAT), and graph recurrent network (GRN). These approaches performed better in comparison to other machine learning methods in RS [
24,
25]. However, due to the aforementioned propensity to bias, one important challenge is to address the treatments of multiple types of biases that can dramatically affect the RS. This requires a prior identification and evaluation process that depends on the objectives pursued and the area of application, among other factors. In the RS field, the characteristics of the algorithms and the objectives may be very different from those of other machine learning fields, so many of the commonly used bias evaluation metrics are not suitable for assessing bias in recommendations. Some fairness metrics evaluate whether predictions are similar for protected and unprotected groups. However, in RS, users’ preferences for items are predicted, and these preferences need not be similar for different groups. E.g., the music that young people listen to is different from the music that older people listen to, women may prefer different types of movies than men, and so on. In addition, many of the RS-specific metrics focus on detecting biases in the recommended items rather than unfairness to system users, e.g., they evaluate whether a similar number of songs by male artists are recommended by female artists.
In this study, we focus on user-centered fairness metrics and the behavior of GNN-based RS with respect to them. Our goal is to evaluate which ones are the most appropriate for recommender systems to help select the best strategy according to the objectives. We consider bias metrics at the individual user level, as well as fairness metrics focused on detecting and quantifying discriminatory treatment of groups of users. Groups formed on the basis of two sensitive attributes, gender, and age, were studied. In addition, we examined the extent to which the level of bias was related to the precision of the recommendation models and other quality measures applicable to the item recommendation lists.
The detection of bias in RS is particularly important due to the influence on user decisions and the progressive worsening of the problem caused by this fact. As previously indicated, the recommendations provided by these systems incite users to consume certain items. These consumption data become part of the datasets used to generate new recommendation models, which will be increasingly biased if there is bias in the initial models.
Bias evaluations in the GNN results were widely reported in the literature; however, work in the context of GNN-based recommender systems is very scarce. Moreover, most studies do not take into account the suitability of the assessment metrics in terms of the objective. This aspect is especially important in the field of RS because features are used to detect bias in other application domains; these are features that influence user preferences and produce different results for different groups of users without implying any bias. In addition, the impact of minimizing fairness bias on other dimensions of recommendation quality is omitted in many papers on the subject at hand. Bearing this in mind, the main contributions of this work are as follows:
Review of the state-of-the-art concerning the evaluation and mitigation of bias in the field of machine learning, especially in the areas of recommender systems and GNN-based methods.
Study of the adequacy of the metrics for evaluating fairness bias that affects users of recommender systems regarding discrimination between different groups.
Examination of the behavior of GNN-based recommendation methods against the aforementioned fairness bias and comparison with other recommendation approaches.
Analysis of the relationship between the values of the fairness metrics and the classic metrics for evaluating the quality of the recommendation lists (precision, recall, mean reciprocal rank, etc.) since the mitigation of biases usually results in a worsening of the recommendation quality evaluated by these measures.
This study is intended to answer the following research questions (RQ) about bias amplification of GNN approaches.
RQ1: Can the findings reported in the literature in the general context of machine learning be extended to the specific field of recommender systems?
RQ2: Do the performances of GNN-based recommendation methods against biases depend on dataset characteristics and sensitive attributes?
RQ3: Are all bias evaluation metrics appropriate for assessing user-centered fairness in recommender systems in all application domains?
RQ4: Do less bias-prone methods always provide lower-quality recommendations?
The work is organized as follows. After this introductory section, the state-of-the-art is discussed in different levels namely machine learning (ML), GNN algorithms, RS, and GNN-based RS. In the next section, the experimental study is described including the methodology, used datasets, recommendation methods, and evaluation metrics, which are explained in further detail. The following sections are devoted to the presentation and discussion of the results. Finally, the last section provides conclusions and future work.
4. Results of the Experimental Study
This section presents the values of the bias metrics obtained by applying the recommendation methods described in the previous section on the three datasets with sensitive attributes, whose exploratory study was presented previously. These real-world datasets are MovieLens 100K, LastFM 100K, and book recommendation. The detailed results allow us to test not only whether GNN-based recommendation methods amplify biases in the data compared to more traditional approaches, but whether their behavior is similar across all types of biases. Since this work is especially focused on user-centered fairness, the main objective is to check the performance of GNN-based methods with respect to appropriate metrics for this type of bias. An important aspect of the analysis is to find out which models provide a better balance between accuracy and bias sensitivity since the improvement of one of them usually has a negative influence on the other.
First, the results of the metrics applicable to top-K item recommendation lists are shown. They were obtained for different values of K (5, 10, and 15). Subsequently, the results of the fairness metrics applicable to rating prediction are presented.
Table 7,
Table 8 and
Table 9 contain the results of the list metrics obtained from the application of the eight recommendation methods on MovieLens, LastFM, and book recommendation datasets, respectively.
The results of the fairness metrics based on sensitive attributes are shown below. These metrics are differential fairness (DF), value unfairness (VU), absolute unfairness (AV), and non-parity unfairness. The gender attribute was studied in the MovieLens and LastFM datasets (
Table 10 and
Table 11) and the age attribute in the three datasets (
Table 12,
Table 13 and
Table 14). The values considered for gender were male and female, and for age—under and equal to 30 years old and over 30 years old.
To facilitate the comparative analysis of the results and to obtain a better insight into the behavior of the algorithms, the data in the tables are graphically represented in the following figures.
First, the metrics related to the quality of the recommendation lists are shown.
Figure 11 shows recall values of all algorithms for item recommendation lists with three different sizes (values of K). In the same way, the precision results are displayed in
Figure 12.
Figure 13 illustrates the values of the MRR metric,
Figure 14 shows those of the HIT measure, and
Figure 15 shows NDCG.
Although this study is more oriented to measuring biases that may have impacts on the unfair treatment of users, we analyze biases related to the popularity and diversity of recommendations since they affect users in individual ways. Within this category, we include the Gini index, which measures the diversity of the distribution of items in the recommendation lists. Likewise, the classic metrics of coverage and average popularity are considered in this group. The Gini index is shown in
Figure 16, item coverage in
Figure 17, and average popularity in
Figure 18.
Finally, we present the graphs corresponding to the metrics aimed at assessing the fairness of recommendations in user groups. In our study, the groups are based on gender and age-sensitive attributes. All these metrics are computed from the ratings predicted by the models; therefore, they do not apply to recommendation lists.
Below, the results of fairness metrics for the gender attribute are provided. These results are the outcome of this experiment on MovieLens and LastFM datasets, which are the two datasets that have this attribute.
We start with the visualization of the Differential Fairness values in
Figure 19. Next,
Figure 20 illustrates the absolute unfairness and value unfairness results. Last,
Figure 21 shows non-parity unfairness.
After presenting the values of the fairness metrics, considering the gender-sensitive attribute, we move on to the visualization of the results corresponding to the age attribute, whose values were divided into two intervals. This last attribute is present in the three datasets studied.
Figure 22 shows the results of differential fairness,
Figure 23 shows those corresponding to value unfairness and absolute unfairness, and
Figure 24 shows non-parity unfairness values.
5. Discussion of Results
Recent literature studies have addressed the problem of bias and fairness in recommender systems, although the focus on GNN-based methods is more limited. Some studies on GNN application in other domains have concluded that these approaches increase performance (in terms of prediction quality) but accentuate biases compared to other methods. However, it is not known whether these conclusions can be extended to the field of RS and all types of biases specific to this area. In this section, we intend to answer these and other research questions formulated in the introduction to this work, after analyzing the results of the extensive study presented in the previous sections. This will allow gaining insight into an issue of such relevance in the context of recommender systems.
In this section, we will analyze the performance against different types of biases of GNN-based recommendation methods in comparison with other classical methods and analyze the impact on the quality (precision, recall, etc.) of the recommendation lists since one of the goals of bias mitigation is to keep the accuracy as high as possible.
Since our study is mainly aimed at user-centered fairness, we will first differentiate between individual-level metrics and group-level metrics. In the former category are the Gini index, item coverage, and average popularity, and in the latter—differential fairness (DF), value unfairness, absolute unfairness, and non-parity. These will be discussed in terms of their appropriateness for the RS field.
One observation of this study is the disparate behavior of the algorithms with the different datasets in terms of quality metrics for item recommendation lists (precision, recall, NDCG). This is to be expected because it has been shown that the accuracies of GNN-based methods largely depend on the characteristics of the datasets. In our case, the number of records of all the datasets is similar, but they vary in terms of the number of users and items to be recommended.
Regarding the bias metrics at the individual level, the results between datasets vary, although not as much as in the previous metrics.
In the context of our work, the Gini index should have high values, which express high inequality and therefore great diversity in the recommendations. The results of our study show that an increase in the values of this metric has an impact on a decrease in precision and other quality metrics of the recommendations. LightGCN is the model that provides the highest Gini values in the MovieLens and LastFM datasets and median values in the Books dataset. However, with this modelm the lowest values of NDCG, HIT, and MRR are obtained in the MovieLens and Book datasets while medium-low values of these metrics are obtained in the LastFM dataset. The behavior in relation to precision and recall is similar. In contrast, NGCF, which is the model that yields the best accuracy and recall results in MovieLens, has medium-high Gini values in this dataset. In the other two datasets, NGCF presents medium-low values of the Gini index and very low precision and recall values. The results for MRR, HIT, and NDCG are similar to those for precision and recall. DGCF gives better Gini results than NGCF but its precision, recall, MRR, HIT, and NDCG values are among the worst. The consequence that can be drawn from this is that the degree of the negative impact of a high Gini value on the accuracy and analogous metric can be very different depending on the algorithm.
Item coverage is another bias metric that affects users individually since the lower the coverage, the lower the probability that the user will receive recommendations of items that they might like. The NGCF models give the highest values of this metric with the LastFM and Book datasets where precision and recall are rather low. Moreover, coverage is quite high with the MovieLens dataset, although with the latter it is surpassed by NeuMF and SGL. With this dataset, the highest precision and recall were achieved. DGCF achieves very good coverage in the Book datasets and medium coverage in the remaining datasets, while, as has already been seen, it is the worst in terms of the quality of the recommendation lists. LightGCN presents the lowest value with the MovieLens dataset and medium–high values with the other two datasets. Therefore, the change in the performance of the models depending on the objective of the evaluation is confirmed here, which can be either the bias or the accuracy of the recommendations.
Regarding the average popularity metric, the goal is to achieve low values so that unpopular items are recommended. The SGL models are the ones with the most uniform behavior in the three datasets, providing very low values for this metric. NGCF achieves the lowest value among all the models with the Book dataset, while its value is medium with MovieLens and LastFM. DGCF models provide good results with the book and LastFM datasets but one of the worst values with MovieLens. In this way, it is confirmed once again that the gain in precision amplifies the biases, but not always to the same degree.
In general, we can say that LightGCN is the algorithm belonging to GNN-based approach that exhibits the most irregular behavior across datasets and list sizes. This uneven performance is found in the classical approaches of collaborative filtering and matrix factorization. Within these classical methods, the one that almost always achieves good precision, recall, MRR, HIT, and NDCG values, is NeuMF, but its results in terms of bias metrics are very irregular.
Regarding the biases at the level of individual users, we can conclude that although the ranking of the models changes depending on whether the quality of the lists or the biases are evaluated, the algorithms based on GNN are not the worst positioned in relation to the biases but some of them reach good Gini index values, coverage, and Average Popularity.
We next examine group-level fairness metrics based on sensitive attributes, such as gender and age. The metrics used in this study are differential fairness (DF), value unfairness (VU), absolute unfairness (AU), and non-parity unfairness. Among them, it is important to differentiate between two different approaches. Within the first are DF and non-parity, whose objective is to find differences in predicted ratings for different user groups. The objectives of VU and AU are different since they are focused on measuring and comparing the quality of recommendations for each group. We consider the latter to be the most appropriate in the domain of recommender systems since the attributes used to create the groups (in our case gender and age) have a proven influence on user preferences. Therefore, the fact that they receive different recommendations need not be unfair or discriminatory.
When analyzing the differential fairness results, all of the GNN-based models generally present lower values than the classical methods with all the datasets and for both the gender and age attributes. There are only a few exceptions where a classical method reaches a value similar to or lower than the GNN-based method. Therefore, the performance of GNN models in relation to this fairness metric is quite good, with NGCF being the worst performer in this category.
Although the results of the Non-Parity metric are somewhat more irregular, the techniques based on GNN, except for NGCF, provide low values of unfairness for both sensitive attributes and almost always lower than the classical algorithms. The results of this metric for the methods in the latter category differ greatly from one dataset to another.
Finally, the results obtained for value and absolute unfairness are quite similar. Unlike what happens with the previous metrics, the results of these metrics with the GNN models are generally worse than with the classical models, except for some occasional exceptions in which some HF method gives the worst result. This behavior occurs for both the gender and age attributes, although the results are better with the MovieLens dataset, especially for the gender attribute.
In relation to the fairness metrics at the group level, we can conclude that precisely the worst behavior of the GNN-based methods occurs with the most appropriate metrics for recommender systems, which are those that are based on the quality of the recommendations. On the contrary, the results of these methods are better for the other type of metrics, which evaluate the similarity in the ratings for the different groups.
Considering all of the results provided, it seems that GNN-based methods have great potential in providing accurate recommendations. It can be concluded that the performances of these methods outperformed the other used models. In addition, among GNN-based approaches, SGL provided higher results on the MovieLens dataset. However, some types of biases may be amplified based on the target dataset and chosen model. In most cases, a higher performance of the model resulted in bias amplification, and unfairness toward disadvantaged groups. This can show the trade-off between accuracy and bias.
Once the results were analyzed, we are in a position to answer the research questions that are the subject of this study. The questions below are followed by findings related to each one.
RQ1: Can the findings reported in the literature in the general context of machine learning be extended to the specific field of recommender systems? Although the literature states that GNN methods are more prone to bias than other classical techniques, the same cannot be said in the area of recommender systems, since some of these methods perform well against bias while maintaining the accuracy of the recommendations.
RQ2: Does the performance of GNN-based recommendation methods against biases depend on dataset characteristics and sensitive attributes? The study showed that the fairness metrics present irregular results for the different datasets. For example, the tested algorithms yield totally different values for the gender-related unfairness metrics in the MovieLens and LastFM datasets, with the gender imbalance being very similar in both datasets. This reveals that the bias in the results is highly dependent on other characteristics of the data. This irregular behavior occurs with different sensitivity attributes.
RQ3: Are all bias evaluation metrics appropriate for assessing user-centered fairness in recommender systems in all application domains? The literature review has allowed us to compile the most commonly used bias metrics, some of which do not assess the quality of the results but rather the similarity of the results themselves for different groups. Because these groups are formed on the basis of sensitive attributes such as gender or age, which were shown to influence preferences, these metrics are not appropriate in the field of recommender systems whose objective is to predict user preferences. In most of the application domains of RS, such as recommendations of movies, music, etc., preferences change according to these attributes. In fact, some methods use them to generate better recommendations.
RQ4: Do less bias-prone methods always provide lower-quality recommendations? Although the decrease in biases generally results in low values of quality metrics such as precision, NDCG, etc., this does not always happen. Some of the GNN-based recommendation methods present good values both for these last metrics and for the bias metrics, presenting in some cases better behavior against bias than classical methods.