1. Introduction
With the increasing popularity and development of the Internet, e-commerce and review platforms have emerged. These platforms allow consumers to post online reviews about their experiences and opinions on various product criteria, including quality and functionality. Online reviews offer valuable information to consumers who lack expert knowledge of the products they wish to purchase and help inform their purchase decisions. However, the abundance of online review data and the fake reviews posted by malicious users make it challenging for consumers to determine the authenticity of reviews and make informed purchase decisions. Thus, it is crucial to identify credible reviews from the vast amount of review data available. This study focuses on the credibility of online product reviews.
To date, numerous experts and scholars have conducted extensive research on the credibility of online product reviews, and they applied their findings to evaluate information credibility across different review platforms and social networks [
1,
2]. These studies involve developing credible models for online reviews to validate their reliability. Researchers examine the factors that influence the credibility of reviews and investigate the distribution patterns of ratings. For instance, Verma et al. proposed a credibility model for online reviews by analyzing the influencing factors of content, communicator, context, and consumer [
3]. They also explored credibility variables associated with these factors and established a causal relationship between the variables and the credibility of online reviews by exploring 22 propositions. Banerjee et al. proposed a theoretical model of reviewer credibility based on dimensions such as positivity, engagement, experience, reputation, competence, and social connections [
4]. They employed robust regression to determine the significance of these factors. Furthermore, Sun et al. developed a reputation rating method based on user rating bias and rating characteristics [
5]. They discovered that the ratings provided by reliable users exhibit a peak distribution, whereas those provided by malicious users are substantially biased. On the other hand, scholars have presented research on the credibility of reviews from review text, ratings, user information, etc. Xiang et al. analyzes the reliability of review data from the aspects of review text semantic features, sentiment, and ratings [
6]. Meel et al. put forward a holistic view of how the information is being weaponized to fulfil the malicious motives and forcefully make a biased user perception about a person, event, or firm [
7]. In addition, review text and ratings are also key factors in the study of the credibility of reviews. For instance, Hazarika pointed out that there is an inconsistency between the review text and the rating in product reviews [
8]. Almansour et al. proposed to build a system by fusing review text, ratings, and sources [
9]. Lo et al. studied the credibility of reviews from the consistency of review text and ratings [
10]. However, these studies have rarely introduced the latest text analysis tools to analyze the sentiment of texts. Additionally, traditional statistical and mathematical approaches are not sufficiently precise and efficient in reviewing texts. In this study, we employ the latest text analysis tool to develop a model based on sentiment analysis to assess the usefulness of comments.
Sentiment analysis (SA) technology is an important means of obtaining emotional tendencies in large-scale comments. SA is the process of gathering and analyzing people’s opinions, thoughts, and impressions regarding various topics, products, subjects, and services [
11]. SA involves analyzing text or speech using computer techniques to determine the sentiment or emotional state within the text [
12,
13]. The latest text analysis model, which can be pretrained on large-scale text data, can be fine-tuned to address sentiment analysis tasks [
14]. Yang et al. proposed a new SA model, SLCABG [
15], and the related experimental results show that the model can effectively improve the performance of text SA. At the same time, SA is also used to analyze large-scale review sets. For instance, Haque et al. used a supervised learning method on a large-scale Amazon dataset to polarize it and achieve satisfactory accuracy [
16]. Guo et al. identified the key dimensions of customer service voiced by hotel visitors using a data mining approach, latent Dirichlet analysis (LDA) [
17], and the related set included 266,544 online reviews for 25,670 hotels located in 16 countries. Thus, this research employs the latest text analysis model, BERT, to conduct SA on comment text, and quantify the SA of the comment text to five levels corresponding to user ratings. This study constructs a sentiment score acquisition method for text comments based on manually annotated sentiment training libraries and combined with the BERT model.
Currently, there are numerous Internet-based review platforms exclusively for automobile brands, which offer rich and standardized data, providing information resources for useful research. Therefore, this research focuses on automobiles as the research subject and develops a model of review usefulness. Research on ranking decisions for automotive products is divided into two primary areas of investigation, namely rating-driven ranking decisions and text review-driven ranking decisions. In the research on rating-based ranking decisions, distributed linguistic term sets remain the core representation tool for transforming rating information. PROMETHEE-II and TODIM are extended to the linguistic term set environment to propose product ranking approaches [
18,
19]. In the research on ranking decisions based on text reviews, by combining the sentiment ratings and star ratings based on the output of the DUTIR sentiment dictionary, scholars developed a PageRank algorithm product ranking technique based on a directed graph model [
20]. Additionally, scholars have addressed the accuracy problem of sentiment intensity recognition by developing a ranking decision approach based on ideal solutions and introducing two interval type fuzzy sets [
21]. Scholars used sentiment analysis techniques to output five types of sentiment ranking by considering the advantages of probabilistic linguistic term sets in characterizing sentiment tendencies and their distribution forms. They combined these sentiment rankings with TODIM and evidence theory to construct a related product ranking decision approach [
22]. The above studies primarily aggregate group wisdom knowledge in large-scale online reviews from a statistical viewpoint, using fuzzy sets, linguistic term sets, and other representational methods. However, they have not fully combined the current advanced text analysis techniques to analyze review texts, or considered word-of-mouth credibility and aggregation weights of heterogeneous individuals, and the inconsistency between text reviews and star ratings, as well as reviewer information disclosure, to conduct research. Thus, the current identification of false reviews is not precise enough. On the other hand, the current research does not consider the aggregation of large-scale ratings from group users, which is easily affected by large-scale fake reviews. To address this issue, this study proposes the construction of a user credibility model, and applies the text analysis model to analyze the user credibility weight of each review during the process of aggregating the wisdom knowledge of group online reviews. A group user score aggregation method is also built to calculate the comprehensive score of automobile brands. Based on this analysis, an automobile ranking decision method is developed.
In summary, this study examines how to weaken the influence of fake reviews and extract real and credible reviews for product ranking, and proposes a user credibility model based on the consistency of review sentiment orientations and ratings to solve the problem of difficult automobile ranking decisions. Compared with existing approaches, this approach examines the credibility of reviews in terms of online review text content, performs sentiment analysis on the review text, uses the high accuracy of the text analysis model, quantifies the sentiment intensity of each review text, and further analyzes the user disclosure information to compute user credibility weights. The contributions of this approach can be summarized as follows.
- (1)
A user weight model based on user disclosure information is constructed, which includes authentication information, interaction information, and driving information. Then, the sentiment analysis techniques and expert knowledge are used to measure the degree of consistency between ratings and text comments, and a comprehensive user weight calculation model is developed.
- (2)
The large-scale group ratings aggregation approach based on user region and comment time division is proposed, and a product ranking method is developed.
2. Problem Description and Data Description
Many consumers encounter difficulties in selecting the appropriate automobile for themselves because of their lack of professional experience and knowledge about automobiles. To address this issue, this research proposes a review credibility model based on user disclosure information and consistency to examine the aggregation of automobile reviews, and compute and rank the overall rating of each automobile brand. The notation defined below is employed to denote the aggregation and variables for this problem.
: represents the number of alternative target automobiles, and represents the i-th target automobile, .
: The data analysis demonstrates that there are eight automobile criteria and is the j-th criteria of the automobile, , corresponding to {space, power, control, electricity/fuel consumption, comfort, exterior, interior, value for money}.
: represents the k-th user who commented on the target automobile , and represents the number of users who commented on the target automobile .
: represents the text of a comment made by user on a criterion of target automobile . In this study, a user can only make one comment on an automobile in the dataset. Therefore, the number of comments is equal to the number of users making comments, .
: represents the star rating of user for a criterion of the target automobile , .
Finally, the automobile’s overall rating is determined from the above dataset by computing the mapping function: , represents the composite rating and represents the mapping function.
3. A User Credibility Model Based on Consistency and User Disclosure Information
In this section, a user credibility model based on consistency and user disclosure information is developed to compute user credibility weights, as shown in
Figure 1.
The target automobile
is filtered according to constraints such as price, budget and model, and relevant comments are crawled on the Autohome.com forum, using Python crawlers, which comprise comment text, ratings, and user disclosure information. Python and SQL tools are e employed to eliminate illegal comment data, and the data are normalized and stored in the format shown in
Table 1.
3.1. User Weights Based on the Consistency of Ratings and Text Reviews
(1) Method construction
Step 1.1 Obtaining text review training set
The automobile reviews are crawled on the Autohome.com forum using Python crawlers, and prepossessing is performed to remove garbled characters, missing information, and other data that do not meet the specifications. At the same time, according to the types of automobiles currently on the market, they are divided into new energy automobiles and gasoline automobiles, and the review data are screened according to the score distribution to make the score distribution even. Finally, the data matrix is obtained.
Step 1.2 Obtaining text review training set with expert sentiment values
The obtained automobile review sample library is uploaded to the built automobile review labeling system to ensure that the data labeling conforms to the specification. At the same time, in order to obtain accurate data, L experts were hired to mark the comment text. Each review text is annotated once by L experts. The experts formulate the rating rules through discussion, and then mark the emotional strength of the comment text according to the rules, as illustrated in
Figure 2. Finally, the data matrix
is obtained.
Step 1.3 Aggregating of expert sentiment values for text review training set
The variance of the marked sentiment score of each comment is calculated and a threshold set to filter the data, so as to maintain the stability of the comment data. Then, the average value of the expert sentiment values
for text review
is calculated as the emotional strength of the label.
(2) Method execution
Step 1.1 Obtaining text review training set
Using crawlers to obtain more than 4000 new energy automobile reviews and more than 4000 gasoline automobile reviews in the forum, a total of more than 8000 automobile review data were obtained. Review data including missing information and garbled characters were removed through preprocessing, and finally 7361 automobile review data were obtained.
Step 1.2 Obtaining text review training set with expert sentiment values
Eight experts were hired to discuss and formulate the sentiment rating rules for automobile review texts; they logged in to the annotation system to annotate the above-mentioned obtained review texts. Finally the matrix
of the text emotional annotations of the eight experts was obtained.
Step 1.3 Aggregating of expert sentiment values for text review training set
First, the variance of the sentiment labeling rating of eight experts for each comment was calculated, and the labeled data with a variance greater than 2 were removed. Then, the sentiment annotation matrix of the above-mentioned automobile review text was aggregate. Finally, 6563 automobile review text sentiment annotation data were obtained to form a sentiment analysis model training dataset. This included eight automobile attributes, where each attribute has a comment text. Finally 52,504 texts were obtained. The distribution of sentiment intensity is shown in
Table 2.
Step 2.1 Building the automobile review sentiment analysis model based on BERT
In this study, the deep learning framework pytorch was used to build the automobile review sentiment analysis model. The main process is shown as the
Figure 3.
The process was as follows: First, remove stop words, stemming, and other preprocessing of the automobile review text. Then extract the topics in the comment text, and then select an efficient deep learning model according to the short text processing effect. This study used the BERT model to build a sentiment analysis model.
The overall framework of the model is a stack of encoders with multiple layers of transformers, with a single-layer structure, as illustrated in
Figure 4.
is the embedded word after the embedding process,
represents the transformer encoding layer of the host, and
represents the word encoding after the multilayer process. The affective intensity prediction process includes embedding, multi-head self-attention, feedforward, and layer normalization.
At the same time, the model parameters are preliminarily set according to the length and data volume of the automobile review text data. The parameter settings of the model are shown in
Table 3.
Step 2.2 Training text review sentiment values
In this study, the aforementioned prepared training dataset was employed to train the BERT model based on the pytorch framework to obtain a sentiment analysis model with eight automobile features.
Figure 5 shows the process of sentiment analysis model training.
Table 4 shows the accuracy of the model.
To quantitatively predict the sentiment intensity of all comments, the trained sentiment analysis model was employed. The model predicts an emotional intensity based on the input utterances using the above module, as illustrated in
Figure 6. The top part [CLS] is the identification of the beginning of the text, which contains the information of the entire sentence, but has no real meaning. The output of all other positions will be biased by placing more weight on the weight of their position, so the first place is output. Then, this is processed by the linear classifier module to predict a label.
The sentiment intensity S of the () indicator review text of user for an automobile is obtained by examining the review text with the sentiment analysis model. This yields the sentiment intensity prediction matrix for all reviews of the automobile . The granularity is the same as the user star rating , which takes on a range of values .
Generally, fake reviews are characterized by inconsistencies between the star rating
and the sentiment intensity
of their review texts. Thus, this section proposes approaches to compute the consistency weights of the two factors. The higher the weight, the higher the degree of consistency and the more credible the review. The computation process is as follows:
The consistency weight matrix is obtained for online reviews of automobiles, , .
3.2. User Weights Based on Disclosure Information
User disclosures include whether they are authenticated, their interaction index (number of replies, number of likes, and number of views), as illustrated in
Table 5, and their daily travel rate, expressed as
. User disclosures provide a side view of the credibility of reviews.
Users who post by word of mouth on the platform can be categorized as certified and non-certified owners. Certified owners are users who have purchased the automobile. The platform’s automobile owner certification requires uploading personal information such as certified automobile models and driving licenses. This information is audited by the platform. In contrast, uncertified owners may be users who have not purchased the automobile in question. Reviews from certified owners are more credible. The weights
are computed, as illustrated in the following formula:
The trip rate weighting is determined by combining the daily trip rate and mileage. Research suggests that many indicators of an automobile require sufficient mileage to test its performance. Thus, the higher the usage of the automobile, the deeper the user’s experience of the automobile’s performance and the more credible the reviews they publish. The usage rate of an automobile can be computed based on its daily driving rate and mileage driven. Based on statistics, the daily driving rate of the automobile is around
, and this study divides the interval accordingly to compute the daily driving rate weights, as shown in
Table 6. Furthermore, statistics on mileage posted by word of mouth in automobile forums indicate that mileage is concentrated at
. The higher the mileage, the lower the number of published word-of-mouth entries, according to which the following intervals are divided, as illustrated in
Table 7.
The usage weights
can be computed by summing up the travel rate weights and the mileage weights as follows, where
represents the automobile usage weight computation parameter:
In this study, word-of-mouth entries posted on automobile forums are viewed, liked, and replied to by other users, and the ratio of the sum of the three to the length of posting is called the interaction index. A higher interaction index indicates that the review is more recognized and considered more credible. To assign weights to the interaction index, the following intervals were computed, as illustrated in the following formula:
Step 4. The weight of
k-th user
for automobile
is given as follows based on the above three indicators:
3.3. User Comprehensive Weight Calculation
Fusing the consistency weight of online reviews with the user disclosure weight yields a user credibility weight for reviews
. The formula is as follows, where
is the parameter for computing the credibility weight of online reviews:
where
.
4. Large-Scale Ratings Aggregation Based on Group User Division
This section proposes a multi-criteria ratings aggregation method for group users to address the issue of weakening the role of user reputation weights in large groups of users, thus weakening the impact of false reviews on the overall rating. The approach first divides users into multiple sets based on their purchase location. Then, all user sets are divided into sets based on their purchase time. User ratings are then computed using user reputation weights. Finally, the aggregation of all user sets is distributed to compute the overall rating.
Step 1.1 Group division method based on user geography
As the users of the platform are automobile owners from different regions of the country, their experience and needs of the automobile may vary. Therefore, the users of each automobile
purchase
are divided into eight collections based on geography. The geographical divisions of China are set up as
The seven sets of users by geography are represented as
Step 1.2 Group division method based on the time of user comments
Time is a crucial factor that should not be overlooked, and reviews are even more time sensitive, with different references at different times. Additionally, the automobiles themselves are being updated and the purchase of services and prices are constantly changing. Thus, the study of reviews must also be approached according to different time periods. This study divides the collection based on geographical divisions using years as the research step, and the difference between the earliest and latest reviews,
, as the research quotient. The time collection is
. The set of users divided by geography and time can be expressed as
,
,
,
, where
represents the number of users who commented during time period
in region
. The group division structure is shown in
Figure 7.
Each user is assigned a weight based on the set division of users, and the final credibility weight for each comment is
,
, which denotes the credibility weight of the
(
) indicator for the automobile
by the
k-th user in year
under a geographical region
. This is then normalized to give
.
To further mitigate the impact of false reviews, the original star rating
of each online review was arithmetically averaged with the predicted sentiment intensity
of its text
to obtain a new rating
for each review on each criteria of the automobile.
where
denotes user
rating of indicator
for automobile
.
After the aforementioned group segmentation, the rating corresponding to user
is
,
. The explanation of the related parameters is shown in
Table 8.
Finally, the rating is multiplied by the credibility weight and summed to obtain the final rating
.
where
, and
represents the combined rating of all users of the
indicator for automobile
.
Finally, the eight indicators were aggregated to find the computed composite rating
for automobile
.
is the weight of criteria
.
5. Product Ranking Methods
Step 1. Collect the data and structure it to obtain the comment dataset of the automobile.
Step 2. Obtain the user weights.
Step 2.1 Calculate the user weights based on information disclosure.
Step 2.2 Calculate the user weights based on the consistency of ratings and text reviews.
Step 2.3 Obtain the user comprehensive weight.
Step 3. Aggregate large-scale ratings.
Step 3.1 Group division based on user geography and comment time.
Step 3.2 Calculate the overall user rating based on ratings and emotional analysis value of text comments.
Step 3.3 Aggregate group user ratings.
Step 3.4 Aggregate multi-criteria ratings.
Step 4. The ranking results for alternative target automobiles
are obtained based on the final overall ratings
where
.
6. Application of the Method
The popularity of e-commerce has resulted in the development of numerous review platforms. As a large commodity, the market for automobiles is huge, and this has resulted in the emergence of specialized automobile review platforms, such as AutoZone. These platforms offer reviews of almost all automobiles and provide comprehensive information, making them one of the most crucial sources of information for consumers. However, many consumers are plagued by false reviews because of their lack of automobile-related knowledge, making it challenging for them to make an informed choice. Thus, this section is based on user disclosure information and a consistency user-credibility model analysis approach to assist consumers in making informed purchasing decisions.
As shown in
Table 9, six alternative target automobiles were selected based on consumers’ budgets and models. All criteria of each automobile brand were analyzed, with the computation process detailed below.
A Python crawler was written to crawl the review data of the corresponding target automobile in AutoZone as of 30 December 2022, as each automobile brand was released at a different time, and thus its review count was different. The crawled data were structured using the Python program.
Table 9 shows the review data obtained after removing illegal review data, such as garbled codes and null values.
The trained sentiment analysis model was employed to predict the sentiment intensity of the preprocessed online review text for each automobile. A new rating, denoted
, was obtained for each of the eight automobile features of each online review, as illustrated in
Table 10.
After obtaining the predicted sentiment intensity
of all the review texts, a consistency analysis was conducted with their corresponding original star ratings to determine a consistency weight
.
Table 11 illustrates the data for the consistency weighting
component of the Dongfeng Nissan-Xuan Yi (
).
Step 2.2 Calculation of weights based on user information disclosure
The characteristic information weights of all review publishers for each automobile were computed. The authentication weight (
), usage rate weight (
), and interaction index weight (
) were computed using the computation approach proposed in the previous section. The parameters
for computing automobile usage rate weight were set, and aggregation was used to obtain the characteristic information weight
for each review publisher, as illustrated in
Table 12 for the partial data of Dongfeng Nissan-Henyi.
Step 2.3 Combined user weighting calculation
Combining user feature information weights and review consistency weights, and setting
, yields a credibility weight
for each review about all automobile criteria for each automobile brand, as illustrated in
Table 13.
According to the purchase area of the reviews divided into eight collections, the Python location function was used to achieve regional collection division, and the number of reviews in each collection was determined, as illustrated in
Table 14.
Step 3.1.2 Group segmentation based on user purchase time periods
In addition to the regional set division, each regional set was again divided into sets by observing the distribution of users’ time to purchase an automobile, as illustrated in
Table 15.
Step 3.1.3 Combined user weighting normalization
Table 16 shows the results of normalizing the combined weights of users after dividing each set.
Step 3.1.4 Calculation of the overall user rating
- (1)
Rating and text emotional intensity combined
The average of the raw ratings of each comment and the sentiment intensity obtained from the text sentiment quantification was computed to obtain the rating
for each comment.
Table 17 presents the average ratings for some of the comments.
- (2)
Overall user rating calculation
Based on each review’s rating and its user reputation weighting to compute its composite rating,
Table 18 illustrates the composite ratings computed for a collection of group users.
Step 3.1.5 Aggregation of group user ratings
- (1)
Group aggregation by user purchase time
Aggregation is based on a collection of time-of-purchase users to compute an overall rating, as shown in
Table 19.
- (2)
Group aggregation by user geography
A composite rating is computed based on the aggregation of the set of users in the area of purchase, as illustrated in
Table 20. Each serial number represents a geographical group.
- (3)
Aggregation of multi-geographical ratings
The multi-criteria composite ratings were obtained by aggregating the aggregated ratings of all regional user groups.
Table 21 illustrates the multi-criteria composite ratings for the six automobile brands.
Step 3.2 Aggregation of multi-criteria ratings
A multi-criteria rating aggregation approach was implemented, and consumers set all weights
,
to obtain the overall ratings of the automobile brands.
Table 22 illustrates the calculation results of the comprehensive score when
in Formula (12) takes different values, and their corresponding rankings for the six automobile brands. And
Figure 8 shows the changing trend.
Figure 6 demonstrates the pattern of the composite ratings obtained from the three different methods.
The composite ratings of each automobile brand were obtained and ranked based on the aforementioned composite rating computation, and the findings are presented in
Table 23.
The ranking order of the automobiles was changed by performing a consistency analysis of the ratings with the review text and the fusion of the text feature information to compute the overall rating of the automobiles, which differed from the original rating of the automobiles. Six automobiles were ranked with the original overall rating of . The ranking results of the overall rating and sentiment analysis overall rating computed using the above approach were . The change in rating and brand ranking demonstrates that fake reviews affect the overall rating and ranking of an automobile brand. Users can employ this analysis to choose an automobile brand that suits their needs for each automobile criterion by simply adjusting the weight W of each automobile criteria. For instance, if they prefer an automobile with comfortable space, Toucan L is the recommended choice; if they prioritize low fuel consumption, Dongfeng Nissan- Xuan Yi is a suitable choice.
This study employs a text sentiment analysis approach as the foundation to examine the consistency between ratings and review text while fusing review text features, which can well verify the authenticity of each review. This study’s experimental findings demonstrate that the proposed analysis approach effectively mitigates the impact of false reviews, allowing consumers to obtain comprehensive ratings of automobile brands devoid of the influence of false reviews, as well as criteria-specific ratings for each brand. Therefore, consumers can personalize the selection of automobile brands based on their needs.
7. Conclusions
Considering that user feature information and content feature information can also reflect the credibility of reviews, this study also calculates the weight of user feature information and content feature information of each review. Considering the inconsistency between online review texts and the corresponding star ratings, this study uses a deep learning model to analyze the sentiment of online review texts, predict the sentiment intensity rating of each text, and compare it with the corresponding star ratings given by users to obtain the consistency weight. By combining objective and subjective factors, the feasible weight of each review can be more accurately calculated. In recalculating the composite ratings, this study split the reviews of an automobile from multiple sets by the location and time of purchase and calculated the composite ratings for each set within the set, taking full account of the impact of the location and time of purchase on the credibility of the reviews. Compared with similar existing studies, the research process, methods, and results of this paper are more interpretable and enlightening. In particular, in terms of user credibility, the proposed approach closely explores the personality characteristics disclosed by users. However, there are still limitations in this study; for example, the product quality complaint data are not considered, and the information is not comprehensive enough. Missing values for future user reviews are also not considered. This research is aimed at automobiles, and further research is needed on how to deal with other fields. In the future, further research and attempts will be made regarding the consideration of user personalized preferences in the ratings aggregation process, and the integration of product quality complaint data provided by users for product rankings.