1. Introduction
Polarization is a segregation process that is based on the construction of a thought based on the assumptions of confrontation of identities [
1]. The polarization process does not take into account what the people of both groups may have in common, but the differences, creating a dynamic of rejection that can lead to growing hostility and discrimination. Ultimately, this could lead to situations in which intolerance gives rise to hate messages that are produced, in this case, on social media networks, but that can be reflected in the offline world [
2]. Within such an environment, certain groups or individuals may become radicalized to violent extremism [
3].
The sixteenth Sustainable Development Goal [
4] promotes peaceful and inclusive societies, providing access to justice for all and building effective, accountable and inclusive institutions at all levels. This goal aligns with the broader human rights framework by promoting societies that respect and uphold individual rights, as well as the right to privacy, freedom of expression, and access to information. In accordance with this goal, the emergence of polarization in discussions on social media networks, and the responsibility of companies in this problem, is a topic that is causing a significant interest in society [
5], although the complete relationship between digital platforms and polarized attitudes remains unclear [
6]. For example, there is some work which indicates that, at least for polarization in people’s views of political parties, the increased use of social networks and the internet may not necessarily be increasing polarization [
7]. In this sense, a systematic literature review has been developed by Iandoli et al. in [
8]. The results of their analysis showed that social media contribute to increase polarization, either by amplifying and escalating social processes that also occur offline, or in specific ways enabled by their design affordances, which also make these platforms prone to manipulation.
Additionally, polarizing content is often the target for hoaxes and fake news [
9]. Thus, the ability to detect polarization is essential in the early detection of misinformation. Indeed, misinformation and disinformation are a threat for society. For example, political disinformation can interfere with elections in democratic countries, while medical misinformation can pose a public health threat [
10]. For this reason, the European Union is concerned about the risk of disinformation and misinformation, and has decided to tackle it [
11].
Online social networks are platforms for social and communicative interaction. Studies have shown that polarization varies greatly across platforms [
12], with the strength of the results being a function of how polarization is measured [
13]. Previous work has studied the presence of polarization in different concrete examples, trying to analyze the relevant characteristics in these cases. For example, the works [
14,
15] studied the emergence of so-called echo chambers, where users are found to interact mainly or only with the users that agree with them, and to have very few interactions with other groups. However, online discussions in social networks can also show polarization where there are answers with negative sentiment between different groups, which can be considered the most worrying situation. For example, ref. [
16] studied hyper-partisanship and polarization in Twitter during the 2018 Brazilian presidential election. Their findings showed that there was more interaction within each group (pro-/anti-Bolzonaro), but there was also an interaction between both groups. Actually, there are also cases where the interaction between groups can be more relevant than those within groups, like in the case studied in [
17], where the analysis of the 2016 U.S. elections on Reddit showed a significant number of interactions between pro-Trump and pro-Hillary supporters. So, the extent to which bias is due to inline echo chambers and filter bubbles is also misunderstood, with evidence pointing in opposite directions [
6]. A filter bubble (
https://www.techopedia.com/definition/28556/filter-bubble, accessed on 12 August 2024) is the intellectual isolation that occurs when websites use artificial intelligence (AI)-driven algorithms to selectively present information based on user behavior, such as clicks, browsing history, search history, and location. This can lead to us being exposed to only one side of an issue, which can make it difficult to form an informed opinion.
A major problem is that divisive content tends to spread widely and quickly on social media. Posts that express moral outrage [
18] or bash one’s outparty [
19], for example, tend to be particularly successful at going viral. This virality may be due, in part, to the prioritization made by social media algorithms, coupled with some possible people’s inclination to favor sensational content. As a consequence, it can happen that when some people log on to social media, they are likely to see content that is divisive and presses their emotional buttons. In addition, these trends encourage politicians, media outlets, and potential influencers to publish divisive content, because it is most likely to give the commitment they desire.
It is worth noting that people exposed to a comments section when reading an online news article tend to express a more extreme view on the topic than people that read the same article without being exposed to the comments [
20]. Thus, online polarization can lead to a more polarized population.
To try to mitigate the factors that may be helping the spread of divisive content, for instance, Facebook has launched some initiatives [
21], even if this kind of content may be the type that produces the maximum attention of their users, thus also being the type producing maximum economic benefit.
Another approach to the analysis of polarization is proposed in the works of Auletta et al. [
22] and Candogan et al. [
23], which consider, in an analytical way, the effect of the social media algorithm on the opinions of members. On the one hand, in [
22], the authors studied how the strength of the influence of the social media and the homophily ratio affect the probability of users reaching a consensus, and how these factors can determine the type of consensus reached. The homophily ratio aims at measuring how much individuals weigh in their similar or alike population compared to others. In particular, they showed that when the homophily ratio is large, social media has a very low capacity of determining the outcome of the opinion dynamics. However, when the homophily ratio is low, the social media influence can have an important role on the dynamics, either by making harder to reach a consensus, or inducing it on extreme opinions. On the other hand, in [
23], the authors defined an opinion dynamics model where social network users update their opinions based on their neighbors’ opinions and on the content shown to them by the platform. Specifically, they showed that users’ opinions approximately converge to some limiting opinion, which is either consensus, where all users agree, or persistent disagreement, where users’ opinions differ.
Because each online social network company can have its own personal interest regarding how and when to control anomalous and extremist behaviors, one fundamental aspect is to define more transparent ways to monitor such possible non-desirable behaviors, so that we can inform solutions for creating healthier social media environments. From our point of view, the goal to be achieved in this direction is to provide mechanisms that allow us to detect situations where one can deduce that polarization is taking place within an online debate, and to a certain level of severity, because there is some criterion that can be quantified and measured.
With this goal in mind, and given that Reddit turned out to be a decisive political debate platform in different elections in the United States, such as those of 2016, in a research work, we introduced a problem [
24] of finding the highest polarized bipartition on a weighted and labeled graph that represents a Reddit debate, where nodes represent user’s opinions and edges agreement or disagreement between users.
To find the highest polarized bipartition of a Reddit debate, our model considers the premise that, in a debate, we can have interactions in two ways: within groups and between groups. Then, an important aspect of our polarization measure is that it allows us to quantify the relevance of the kinds of interactions within the debate, and the polarization degree of the bipartition, which are based on how homogeneous each partition is, and how negative the interactions are between both partitions.
Our measure does not try to prove a correlation between the polarization of a particular online debate and the true opinions of their users. That is, the measure takes only into account the specific comments expressed in a particular debate.
Also, in [
24], we showed that finding this target bipartition is an optimization problem that can be seen as a generalization of the maxcut problem, and we introduced a basic local search algorithm to find approximate solutions of the problem. Later on, in [
25], we also studied the complexity of solving real instances from Reddit discussions with both complete and approximate algorithms. These results on real Reddit instances showed that solving those instances could be performed with little computational cost as the size of the instances increased, and that all the instances, obtained from different subreddits, tended to show polarization values away from neutral discussions, although without reaching the case of extreme polarization.
Our aim in this paper is to further investigate the complexity of this problem by introducing an instance generation model, where a single parameter controls the polarization of the instances in such a way that this correlates with the average complexity to solve those instances. The average complexity results we obtain are consistent with our hypothesis: the higher the polarization of the instance, the easier is to find the corresponding polarized bipartition. Although one can consider other alternative ways to measure the polarization in online social networks, we believe that our generation model could be used to easily check other measures. The results obtained with our generation model, together with the previous results on real instances from Reddit discussions, seem to indicate that, in practice, it may be feasible to compute the polarization of online discussions, at least with a polarization measure like the one we use here.
Therefore, in view of the experimental results, checking polarization with this measure seems to only be difficult in unrealistic cases. Clearly, it is advantageous to offer algorithms for monitoring polarization in online social networks.
The structure of the rest of the paper is as follows. In
Section 2, we present both the model for weighted and labeled graphs that represents an online debate, and the measure to quantify the polarization in such debate graphs, studied and developed in [
24]. In
Section 3, we prove that the computation of such a measure is NP-hard. In
Section 4, we describe the algorithms that we use to solve the bipartite polarization problem. In
Section 5, we introduce our random generation model for user debate graphs based on a main parameter, called
, to control the polarization of the instances. In
Section 6, we characterize a case of the bipartite polarization problem that can be solved in linear time. Finally, in
Section 7, we perform an empirical evaluation to study how the complexity of solving the typical instances obtained with our random model changes as we modify the parameter
, and how this, at the same time, affects to the polarization of the instances.
2. Problem Definition
Polarization corresponds to the social process whereby a group of persons is divided into two opposing subgroups having conflicting and contrasting positions, goals, and viewpoints, with few individuals remaining neutral or holding an intermediate position [
26]. The model we consider in this work is inspired by the model defined in [
27] to identify supporting or opposing opinions in online debates, based on finding a maximum cut in a graph that models the interactions between users. Moreover, as we are interested in quantifying polarization, we consider a model that is based on a weighted graph and with labeled edges, where node weights represent the side of the user in the debate, and edge labels represent the overall sentiment between two users. Then, given a bipartition of this graph, the polarization degree of the bipartition is based on how homogeneous each partition is, and how negative the interactions are between both partitions. Finally, the measure of debate polarization is based on the maximum polarization obtained in all the possible bipartitions of the graph.
Adapting from [
24], an online debate
on a root comment
r is a non-empty set of comments that were originated as successive answers to a root comment
r. An online debate is modeled as a two-sided debate tree, where nodes are labeled with a binary value that denotes whether the comment is in agreement (1) or in disagreement (−1) with the root comment. Notice that in [
24], we introduced this model specifically for Reddit debates. However, it is clearly a model suitable for other, similar social networks.
Definition 1 (Two-Sided Debate Tree). Let Γ be an online debate on a root comment r. A Two-Sided Debate Tree (SDebT) for Γ is a tuple , defined as follows:
For every comment in Γ, there is a node in C.
Node is the root node of .
If a comment answers another comment , there is a directed edge in E.
W is a labeling function of answers (edges) , where the value assigned to an edge denotes the sentiment of the answer with respect to , from highly negative () to highly positive (2).
S is a labeling function of comments (nodes) , where the value assigned to a node denotes whether the comment is in agreement (1) or in disagreement (−1) with the root comment r, and it is defined as follows:
- -
;
- -
For all nodes in C, if, for some node , and either and , or and ; otherwise, .
Only the nodes and edges obtained by applying this process belong to C and E, respectively.
Next, we present the formalization of a User Debate Graph based on a Two-Sided Debate Tree, where now all the comments of the same user are aggregated into a single node that represents the user’s opinion.
Definition 2 (User Debate Graph). Let Γ be a online debate on a root comment r with users’ identifiers , and let be a SDebT for Γ. A User Debate Graph (UDebG) for is a tuple , where:
is the set of nodes of , defined as the set of users’ opinions , i.e., with , for all users .
is the set of edges of , defined as the set of interactions between different users in the debate, i.e., there is an edge , with and , if and only if, for some , we have that and .
is an opinion weighting scheme for that expresses the side of users in the debate based on the side of their comments. We define as the mapping that assigns to every node the valuein the real interval , which expresses the side of the user with respect to the root comment, from strictly disagreement (−1) to strictly agreement (1), going through undecided opinions (0). is an interaction weighting scheme for that expresses the overall sentiment between users by combining the individual sentiment values assigned to the responses between their comments.
We define as the mapping that assigns to every edge a value , defined as follows:where w expresses the overall sentiment of the user regarding the comments of the user , from highly negative (−2) to highly positive (2). Only the nodes and edges obtained by applying this process belong to and , respectively.
Given a User Debate Graph
, a model to measure the level of polarization in the debate between its users was also introduced in [
24]. It is based on two characteristics that a polarization measure should capture. First, a polarized debate should contain a bipartition of
into two sets
, such that the set
L contains mainly users in disagreement, the set
R contains mainly users in agreement, and both sets should be similar in size. The second ingredient is the sentiment between users of
L and
R. A polarized discussion should contain most of the negative interactions between users of
L and users of
R, whereas the positive interactions, if any, should be mainly within the users of
L and within the users of
R.
To capture these two characteristics with a single value, two different measures are combined in a final one, referred to as the bipartite polarization.
Definition 3 (Bipartite polarization). Given a User Debate Graph and a bipartition of , we define:
The level of consistency and balance of is a real value in , defined as follows:withand The sentiment of the interactions between users of different sides is a real value in , defined as follows: Notice that according to [25], each edge label can incorporate a correction factor that is used to modify the final weight used in the function. However, to simplify the model notation, in this work we will consider that the interaction weighting scheme already reflects this factor.
Then, the bipartite polarization of on a bipartition is the value in the real interval , defined as follows:Finally, the bipartite polarization of is the maximum value of among all possible bipartitions . 5. A Random Generator Model for UDebG Instances
We present a random generator of UDebG instances, where the goal is to control the expected bipartite polarization of the instance by means of a single parameter , that lies in the range , and where the number of nodes (users) is given by a parameter m. Note that, for , all the nodes would have . This extreme case corresponds to the trivial problem where the bipartite polarization is zero for any bipartition. For this reason, is not considered in our random generator.
The generation process consists of the two following steps:
Generation of the set of nodes
with their
value. Each node
is generated with a value
from
, obtained with a bimodal distribution that arises as a mixture of two different truncated normal distributions
and
. The first distribution is defined on the interval
with mean
and standard deviation
equal to:
And, analogously for the second distribution, but now defined on the interval
, and with values:
So, with this bimodal distribution, the values are concentrated around the two modes ( and ), but how tightly they concentrate is inversely proportional to , so the higher the value of , the smaller the standard deviation.
Generation of the set of edges
with their
value. For each node
i, we randomly select a set of
k target vertices
, with
k randomly selected from
, to build outgoing edges from
i to those vertices. The value of
is generated with a truncated normal distribution on the interval
with
and
that depend on the values of
and
as follows:
So, when the users and are on the same side ( and are both either positive or ), the mean of the distribution will be positive, and the more similar the values and are, and the closer is to 1, the closer will be to 2. By contrast, when the users and are on different sides, the more different the values and are, and the closer is to 1, the closer will be to −2. Observe that the sign of depends on the sign of users i and j.
Regarding the absolute value , it depends on both and , but more on the first one. This is because, in principle, the overall sentiment (answers from user i to user j) is not necessarily equal to , although it is natural to think that there will be some positive correlation between them. In any case, we think that it makes sense to force the sign of the mean to be the same when generating and . As for the absolute value, it is natural that users with stronger opinions (those with closer to 1) have stronger sentiments towards the rest of the users.
The overall intention of this way to select the values of and is to move from a UDebG instance corresponding to a clearly neutral discussion when approaches 0 to one corresponding to a highly polarized discussion when approaches 1. We control this by making the expected values of both and approach the neutral value 0 as approaches 0. Observe, however, that even in the case when for the generation of , the standard deviation is . This allows for the generation of some negative and positive values around the mean, to ensure a more realistic generation of users where not all the answers to any user are strictly neutral. Then, as approaches 1, the expected value for tends to be more tightly concentrated around −1 or 1 due to the standard deviation tending to , and the expected value for is more concentrated around the extreme values and 2, due to the standard deviation tending to .
We set that, for , the standard deviation for the generation of and is not 0, but very low, because we do not believe that, in practice, one encounters real discussions with such extreme polarization value of 1. So, we have preferred instead to move towards a zone of high polarization, but without reaching the extreme case. However, it is very easy to expand the range of possible polarizations by simply using other values to move the standard deviation of both distributions closer to 0 as approaches 1.
To select the number of out edges for each node, we use the function
as the maximum number of out edges per node. Once more, we use this function as we believe this better represents the number of interactions of real debates in social networks. If we compute the maximum out degree and mean out degree of each instance used in [
25] (instances obtained from the authors of the paper), we obtain that the mean of the maximum out degree of all the instances is 4.05, and the mean of the mean out degree of all the instances is 0.89, where the instances have a median number of nodes around 50. Notice that there are some nodes (users) that only answer to the root node. As the answers to the root node are not considered for the
UDebG, they do not have out edges, so it is possible to have a mean out degree below 1 in the
UDebG. With those numbers in mind, the
function allows us to limit the maximum out degree to a realistic number of edges as
m increases and, at the same time, keep a low mean value.
6. A P-Time Solvable Case
In this section, we present a simple case of the bipartite polarization problem that can be trivially solved in linear time. It is an interesting case because the characteristics of the instances of this case are clearly more present in the random generator model we have presented in the previous section as parameter tends to 1. So, the fact that this case can be solved in polynomial time can be considered as an argument for supporting the experimental results that we show in the next section, where the higher the value of , the lower the typical complexity to solve the instances of the random generator.
The instances
of this case satisfy the following restriction:
So, observe that this case represents those instances where the only negative interactions are between users in opposite sides, and the only positive ones are between users in the same side. Observe that this case does not only contain the extreme case of bipartite polarization equal to one, as it also contains instances where the number of users with positive can be as different, as we want with respect to the number of users with negative .
Then, it turns out that a solution for the
problem (an optimal bipartition) for any such instance
is simply:
Because any other different bipartition
satisfies:
To see why, consider the two following sets that define the differences between
and
:
In short,
contains the nodes with negative (or 0)
that are not in
but in
; similarly,
contains the nodes with positive
that are not in
, but in
.
Given these differences between and , it turns out that:
Regarding the value of
, we have that:
Then, if , that means that at least one the sets and is not empty, so will be smaller than or equal to . Observe that it will be equal when is empty, and any vertex v in has .
Regarding the value of
, this value can be equal to or smaller than the value
. It will be equal when
where
. That is, when all the negative edges between
and
are also present between
and
, when all the positive edges between vertices within
are also present within
, and all the positive edges between vertices within
are also present within
.
Then, it will be smaller when there are some negative edges between and , or between and , so these negative edges are not present between and . Or, it will also be smaller when there are positive edges between vertices within or within that are present between and .
So, the value of will be smaller than or equal to the value of . Then, as it can be checked in linear time whether an input instance satisfies the restrictions of this class, this class can be solved in polynomial time.
Even though none of the solving algorithms presented in
Section 4 try to explicitly check for this case, it turns out that the random way to initialize the bipartition in the local search algorithm creates a bipartition that is very close to the optimal bipartition for this case when the values for
are close to the extreme values −1 and 1. Therefore, at least for the instances generated by our random generator when
tends to 1, it is highly likely that the initial bipartition generated in the local search algorithm is very close to the optimal one for the instances of this case.
7. Experimental Results
In this section, we present an empirical evaluation of the complexity to solve the instances obtained with our random generator for different values of the parameter
and increasing the number of nodes (users) of the instances. Our main interest is to understand how the complexity of the exact algorithm based on integer programming changes as
moves from almost 0 to 1. As a second goal, we want also to compare the quality of the approximate polarization value returned by the algorithm based on local search. The results reported in [
25] indicate that, for real instances coming from Reddit discussions, the median time complexity was always very low and that the approximation of the polarization value given by the local search algorithm was always almost identical to the exact value. The experiments were performed on an Intel® Core® i5-6400 CPU at 2.70 GHz with 16 GB of RAM memory.
The results of the median time to solve instances with
and
as well as the bipartite polarization of the instances, are shown in
Figure 1. For each combination of
and
m, the median value shown on the figure is obtained from a set of 50 instances obtained with our random generator. We have used close values of
up to
because, as one can observe, the median time to solve the instances increases quickly as
m increases for very low values of
, but as we move away from 0 the median time decreases abruptly. At the same time, the median bipartite polarization of the instances increases slowly for low values of
, but then it starts to increase more quickly as
increases. So, these results are consistent with our hypothesis about the relation between polarization and complexity to solve the instances.
If we compare these results with the ones with real instances from Reddit in [
25], which have a median number of nodes of around 50, we observe that they are also consistent, in the sense that in the Reddit discussions used in that paper, the median polarization of the instances was around 0.5, and the median time was around 2 s, using the same hardware used here. To make a more complete comparison of the complexity, we have also computed the median number of nodes performed by the exact algorithm based on the SCIP solver (the same one used in that previous paper), given that in our previous paper, we also showed the median number of nodes of the SCIP search tree.
Figure 2 shows the median number of nodes for the same instances of
Figure 1. As we can observe, the number of nodes of the SCIP search tree follows the same qualitative behavior as the time, and for instances with median polarization around 0.5 (which have
), the median number of nodes would be around 1 (by interpolating between our cases
and
), as happens in the results of our previous paper with the Reddit instances.
Although our choice of the value
as the value for the maximum number of out edges for each node in our random generator is consistent with the characteristics of the Reddit instances solved in [
25], we wanted to check if the complexity patterns we have observed in the previous experiments were not a result of a too low value for the number of out edges. This is something relevant, as the NP-hardness results for maxcut in [
28], and the problem we have used to show NP-hardness of our problem, are valid for graphs with bounded degree 3, but the average degree of the graphs generated for the experiments in
Figure 1 and
Figure 2 have an average degree around 1, given the maximum value used for
m (40). So, in order to eliminate the possibility of easy instances due to low degree graphs, we have generated a second set of instances of larger size, with
m ranging from 40 to 90, and where the maximum out degree of the instances is now equal to
, now giving the value 6 as the maximum out degree for vertices, and 3.5 being the average degree.
Figure 3 shows the same plots as
Figure 1, but obtained with these new instances with higher vertex degrees. Notice that the plots now start at
, since the complexity of solving these instances for lower values of
is expected to be greater than in the previous experiments, given the higher values for
m. In the left plot, we can observe that the complexity in CPU time increases as we lower the
value, but this increase does not seem exponential, at least for
. The right plot shows the median bipartite polarization of the instances, and we can see that its behavior is very similar to the one of the previous experiment. So, these results confirm that even increasing the vertex degree, the typical complexity of solving the instances seems to be tightly correlated with the value of
, such that the exponential time increase seems more clear for low values of
(
), but not for higher values that show a much slow increase in the typical complexity.
Finally, we were also interested in comparing the quality of the approximation provided by the local search approach, presented in
Section 4, with the one of the exact algorithm.
Table 1 shows, for each combination of number of nodes and value of
, the minimum, median and maximum value for the polarization obtained in each set of instances with the exact algorithm (columns labeled with
SCIP BipPol). Observe that for higher values of
, we observe more variations in the polarization of the instances (difference between the min and the max values). Then, in the next three columns we show for the same three instances of the previous columns (instance with minimum polarization, with median one, and with maximum one) the ratio of the approximate polarization computed by the local search solver to the exact polarization of the instance (LS ratio). As we observe, the quality of the solutions computed by the local search algorithm is perfect for instances with
(that are precisely the ones that are easily solved by the exact algorithm) and for lower values of
we observe only a tiny relative error. Overall, we can say that the quality of the solution provided by the local search algorithm is always satisfactory with the advantage that its computational complexity is always linear with respect to the number of nodes, so there is no hard region for the local search approach.