1. Introduction
In the realm of sustainable hospitality and tourism, online reviews impact travelers’ choices, and comprehensive system that can discern between genuine and fraudulent hotel reviews is required. This study tackles the urgent problem of false “opinion spam” in the hospitality industry, which negatively impacts online hotel reviews. This research intends to construct a dependable detection system that can reliably classify hotel evaluations as either genuine or deceptive, building upon the thorough studies conducted by [
1,
2]. Legitimate online hotel review sites are protected, and the overall traveler experience is improved with the successful application of this system. The ultimate objective is to support the integrity of the hotel industry by giving clients reliable information so they can plan their trips with confidence.
The veracity of online reviews, especially those about hotels, is crucial in the digital age. Reviews of hotels have developed into a significant factor in decision-making for tourists. However, the prevalence of false opinions, or “opinion spam,” has grown to be a worrying problem in the large sea of hotel evaluations. Travelers may experience financial losses and irritation as a result of false hotel reviews, which can damage the reputation of review sites. The “Deceptive Opinion Spam” dataset, which includes reviews for 20 Chicago hotels that have been categorized based on sentiment, is the basis of this capstone project’s attempt to solve the challenge of differentiating between honest and dishonest hotel evaluations. Positive and negative reviews are further divided into these categories. Comprehensive investigations of positive and negative deceptive opinion spam were carried out, respectively, in reference papers [
1,
2]; these publications are essential knowledge sources for our project since they provide useful information about misleading reviews and possible ways to spot them. The objectives of this research are as follows:
The Development of a Robust Detection System: To develop a system that is extremely accurate and dependable for identifying and classifying false hotel reviews, assuring their separation from real ones. Using foundational research publications [
1,
2] as a guide, we will explore the linguistic and psycholinguistic aspects that are suggestive of dishonesty in hotel evaluations.
Performance Evaluation: to assess how well different machine learning models and methods work at spotting false information in hotel reviews.
Sentiment–Deceit Relationship Analysis: to investigate the complex relationship between sentiment and deceit in the setting of hotel reviews, maybe illuminating how feelings affect deceptive behavior.
Theoretical Contributions: to advance our knowledge of the language patterns and cognitive characteristics connected to deceptive reviews, making theoretical contributions to the field of computational linguistics.
Assessment of Real-World Applicability: to assess the viability and efficacy of integrating the created detection system into real hotel review systems, hence boosting their credibility.
This study assumes that the “Deceptive Opinion Spam” dataset, and particularly the reviews for 20 Chicago hotels, is representative of the broader landscape of hotel reviews. The findings and insights derived from this dataset are expected to be generalized to a wider context. The accuracy of the detection system relies on the identification of relevant features indicative of deceptive or truthful reviews. This study acknowledges that variations in feature selection could impact the system’s performance. There are 800 truthful and 800 deceptive reviews, combined in the total dataset to make 1600, which is not a good number for deep learning models as this can be overfitted.
The main issue that this study attempts to solve is the pervasive problem of fraudulent online reviews. These reviews have the potential to deceive buyers, affect their decision to buy, and damage the reputation of review sites. Our dataset is obtained from the hospitality industry, which is particularly susceptible to fraudulent reviews that can damage hotel brands and result in losses of money. Deceptive review identification and classification is a serious difficulty in this profession. For training, deep learning models like CNN, LSTM, and RNN need large amounts of data, yet false reviews are frequently sparser and more detailed than real ones. Overfitting, in which models perform well on training data but struggle to generalize successfully to unknown data, might result from this data imbalance. The results of our work have important real-world applications, notwithstanding the difficulties in this area. Review sites can utilize the established algorithms to identify and flag reviews that may be fraudulent, making the user experience more dependable and trustworthy. This affects not just the hotel sector but also other industries where customer decision-making is heavily influenced by online reviews.
The remaining paper is structured as mentioned below: Related works are presented in
Section 2. The methodologies, consisting of Data Preprocessing, feature engineering, and visualization, the choice of the model, along with training the model, the performance of the model, and metrics, are showcased in
Section 3. The detailed results and analysis of the three deep learning techniques are illustrated in
Section 4. Finally, the research is concluded in
Section 5.
2. Literature Review
There are a variety of methods for identifying and comprehending fraudulent activities related to fake internet reviews. Numerous creative methods have been applied in the realm of deceptive review detection. To identify misleading opinion spam, Ott et al. [
2] created a machine learning framework, which was a precursor to automated analysis. Using deep learning techniques, Jain et al. [
3] built on this to further advance the identification of false reviews. Moon et al. [
4], who investigated false customer reviews using a survey-based text categorization approach, supplemented this line of research.
Parallel to this, Plotkina et al. [
5] investigated the identification of fraudulent reviews, emphasizing their findings from both computational and human viewpoints. An efficient representation of fraudulent opinion identification was provided by Cagnina and Rosso [
6], who concentrated on both intra- and cross-domain classification. Chang et al. [
7] extended these domain-specific approaches by proposing a rumor-based model to identify fake reviews in the hospitality sector, specifically in hotel reviews. Filieri [
8] contributed to the discussion by looking at variables that affect the validity of online user reviews.
Larger-scale data has made it possible to conduct more thorough research. In an investigation into possibly false TripAdvisor hotel evaluations, Harris [
9] revealed the breadth and depth of dishonest internet review techniques. A model for fake review identification was presented by Cao et al. [
10], emphasizing multi-feature learning and independent training for classification. By examining phony review comments using the prism of rumor and lie theories, Lin et al. [
11] advanced their theoretical understanding of the topic. Pascucci et al. [
12] created a tool for detecting fraudulent reviews in the hotel industry using computational stylometry. Authentic and fake user-generated hotel evaluations were identified by Banerjee et al. [
13,
14], who used language analysis to verify the veracity of online reviews. These researchers made additional contributions to the topic. In the hospitality industry, Rout et al. [
15] used machine learning techniques to identify false evaluations, while Martinez-Torres and Toral [
16] addressed deceptive reviews using both labeled and unlabeled data. The resilience of word and character n-gram combinations in distinguishing between false and accurate opinions was examined by Siagian and Aritsugi [
17]. The exaggeration in phony versus real web reviews for upscale and inexpensive lodgings was investigated by researchers. Lastly, by merging coarse- and fine-grained information further, a thorough framework for fake review identification was presented, providing a reliable detection technique. By establishing connections between these studies, we can obtain a more comprehensive grasp of the dynamic field of fraudulent review identification, showcasing the range of approaches and structures created to tackle this important problem in the era of online reviews. In this work, three well-known deep learning techniques—Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), and Convolutional Neural Networks (CNNs)—are applied using a robust strategy, drawing on insights from the abovementioned related works. This multi-pronged approach seeks to attain maximum precision in tackling the issues presented by fraudulent opinion spam and phony customer evaluations [
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29].
3. Materials and Methods
This section delves into the experiments and techniques used to address research concerns about the identification of opinion spam that is deceptive in hotel reviews. Using cutting-edge deep learning methods—such as Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and Recurrent Neural Networks (RNNs)—based on the findings from an extensive literature review is the main goal, as illustrated in
Figure 1 as a workflow block diagram.
3.1. Data Preprocessing, Feature Engineering, and Visualization
A comprehensive investigation of data preparation methods is conducted before the model is trained. This includes managing missing values, encoding labels, and cleaning the text data. To extract pertinent information from the reviews, feature engineering is used, and visualization techniques are performed for a thorough knowledge of the dataset. First of all, the customer reviews are pre-processed in the DataFrame by the supplied Python script. It makes use of the clean_text function to convert text to lowercase and eliminate superfluous whitespace, digits, and punctuation. In addition, both custom and common English stop words are removed. Textual data are improved by this preprocessing, which qualifies the data for further investigations such as sentiment analysis or machine learning model training.
For a further exploratory data analysis, the distribution of classes (truthful vs. deceptive) is visualized using a count plot, as shown in
Figure 2. It is observed that the number of truthful and deceptive reviews is 800 each. Machine learning models need to be trained using this balanced distribution to avoid biases towards any one class and to enable the model to acquire patterns from both classes equally.
The text data are transformed using TF-IDF vectorization, and the top 20 words by frequency are visualized in a barplot, as shown in
Figure 3. Each bar in this barplot represents a word, and the height of the bar indicates how frequently or how important the word is. The words in the dataset that stand out the most or are characteristic are highlighted in this visualization. The barplot graphically displays the top 20 words according to their TF-IDF scores, and this procedure aids in identifying the significance of terms in the text data. This is a standard procedure for getting textual data ready for machine learning jobs.
Histograms illustrate the distribution of word lengths and character lengths in the reviews, segmented by truthfulness and deceptiveness, as shown in
Figure 4 and
Figure 5, respectively. The
Y-axis in
Figure 4 shows the frequency or count of reviews falling into each range, while the
X-axis shows the range of word lengths (e.g., number of characters in a word). Based on the combination of truthfulness (truthful or deceptive) and deceptiveness, the data are divided into four groups. The word lengths that each bar in the histogram represents are ranges. The number of evaluations carried out for a certain truthfulness and deceptiveness category whose word lengths fall into that range is indicated by the height of the bar. The histogram offers a comprehensive perspective on the distribution of word lengths among several categories, facilitating a refined examination of linguistic trends. The character length range in the reviews is represented by the
X-axis in
Figure 5, while the frequency or count of reviews falling into each range is represented by the
Y-axis. This histogram, like the word length one, is divided into two categories: truthful and deceptive. Numerous character lengths are represented by each bar. The height of the bar for each truthfulness and deceptiveness category represents the proportion of reviews that fall within that range. A histogram enables a thorough analysis of the distribution of character lengths among various categories, exposing possible differences in language use.
A word cloud was generated for truthful and deceptive reviews, providing a visual representation of the most frequent words, as in
Figure 6. The word cloud in
Figure 6 illustrates the most commonly used words in honest and fraudulent reviews. Each word’s size in the cloud corresponds to how frequently that term occurs in the dataset. Higher frequency is indicated by larger words. For aesthetic reasons, words may be displayed in different colors; however, the focus is on size for their frequency representation. This word cloud can be used to rapidly discover common terms and patterns seen in reviews that have been classified as truthful and deceptive. Visually conspicuous words are those that are commonly used and stand out in terms of frequency.
Words that are distinctive or have a lot of weight in a particular context are highlighted using the TF-IDF (Term Frequency-Inverse Document Frequency) approach. A TF-IDF bar plot illustrates how different words in a corpus are related to one another. Higher TF-IDF scores indicate that a word is more unique inside a context when compared to the entire corpus. In this case, the TF-IDF bar plot shows which terms are most important in honest and dishonest reviews, as shown in
Figure 7. Because they are uncommon or regularly appear in one category but not the other, these terms are given more weight. With its suggestions for terms that may be suggestive of each class, this visualization helps readers understand the crucial distinctions between evaluations that are accurate and those that are dishonest.
The data points were graphically represented in two dimensions by a scatter plot. In this case, words are represented in a scatter plot through the use of word embeddings, which translate words into numerical vectors. These high-dimensional vectors are reduced to two principal components by the use of principal component analysis (PCA), yielding a two-dimensional representation, as shown in
Figure 8. We may better comprehend the relationships between various words based on their semantic similarity by using a scatter plot for word embeddings. Semantically related words that are closer together in the scatter plot suggest that they may be used in related settings. The linguistic structure of honest and dishonest evaluations can be better understood by utilizing this graphic, which can highlight word clusters with related meanings.
To categorize words based on their part-of-speech tags, creating informative and imaginative word count distributions, the spaCy module is used and is depicted in
Figure 9. Different kinds of parts of speech such as adjectives, verbs, nouns, and other speech components are emphasized. Sentiment polarity is calculated using TextBlob and histograms depict the sentiment distribution of truthful and deceptive reviews, as illustrated in
Figure 10. This gives a general idea of the reviews’ emotional content in each category. Further, in
Table 1, bigrams and trigrams are showcased for both truthful and deceptive reviews, which facilitates comprehension of the typical expressions and word combinations inside each category. A horizontal bar chart compares the frequency of top words in truthful and deceptive reviews using Count Vectorization, as shown in
Figure 11, which makes it easier to find terms that are unique to each category. These analyses collectively provide insights into the dataset’s characteristics and contribute to its preparation for subsequent modeling or further investigation.
3.2. Choice of Model
Three distinct deep learning models have been selected for the provided dataset: a Simple Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM). With the ability to identify temporal correlations and spatial patterns in textual content, these models are highly suitable for sequencing data and text classification applications. A detailed explanation of three models a CNN, LSTM, and RNN, is as follows:
CNNs are a kind of deep learning model that is mainly utilized for tasks related to text and image processing because of their capacity to identify hierarchical features and local patterns. We employed an embedding layer in our CNN architecture to turn words into fixed-size (16 in this example) dense vectors. Next, we used a global max-pooling layer and a convolutional layer. The words supplied are transformed into dense vectors with a predetermined embedding dimension (16) by the embedding layer. This layer, which is either trained from scratch or initialized with pre-learned embeddings, aids in capturing word relationships. The convolutional layer applies filters to word embeddings through a sliding window method, identifying regional trends and characteristics. Our CNN employs a Conv1D layer with 5 kernel sizes and 128 filters. The global max-pooling layer reduces dimensionality while preserving the most important features by taking the maximum value from each filter. The dense layer is for binary classifications and a completely linked layer with a single output node and a sigmoid activation function. A total of 3,780,689 parameters make up the CNN architecture; of these, 10,497 are trainable and the remaining ones are not, showing the utilization of pre-trained embeddings. Patterns and features in the text data are successfully identified using this approach.
Recurrent Neural Networks (RNNs) with long-range dependencies can handle sequential data using LSTMs. Two LSTM layers with dropout for regularization make up our LSTM design, which is followed by a dense layer for classification. Words are converted into dense vectors by the embedding layer, as explained in the CNN model. The dropout layer is added after the embedding layer and drops units at random during training to avoid overfitting. The following LSTM layer can process sequences because the first LSTM layer contains 50 units and its return_sequences parameter is set to True. The final output sequence is provided by the 50-unit second LSTM layer. The dense layer is a layer with a sigmoid activation function, one output node, and complete connectivity. There are 3,803,843 parameters in the LSTM design overall, of which 33,651 are trainable. The LSTM is appropriate for sentiment analysis and other text-based tasks because of its sequential processing capacity, which aids in capturing dependencies within the text data.
RNNs are made to handle sequential data but, because of the vanishing gradient problem, they may not be able to handle long-range relationships as well as LSTMs. For binary classifications, the RNN architecture combines a dense layer with a SimpleRNN layer. The embedding layer converts words into dense vectors in a manner reminiscent of earlier models. The SimpleRNN layer can handle jobs requiring time series or text sequences because it has 100 units and processes sequential data recurrently. The dense Layer, like in the other models, is fully linked, has a single output node, and uses a sigmoid activation function. There are 3,781,993 parameters in the RNN design overall, of which 11,801 are trainable. RNNs are effective for shorter sequences but less reliable than LSTMs for longer ones.
3.3. Training the Model, Performance of the Model, and Metrics
A dataset was used to train three distinct deep learning models: a Simple Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM). For every model, dense layers, dropout for regularization, and an embedding layer were used in the construction of the sequential LSTM model. It was compiled using the Adam optimizer with binary cross-entropy loss. The training data were then used to train the model for 100 epochs. The sequential CNN model was built with an embedding layer, a global max-pooling layer, and a 1D convolutional layer. It was compiled using the Adam optimizer with binary cross-entropy loss, the same as the LSTM model. A total of 100 epochs were used for its training. An embedding layer and a SimpleRNN layer were used in the construction of the sequential RNN model. It was trained for 100 epochs using the Adam optimizer and binary cross-entropy loss, just like the other models.
The models’ performance was assessed on the test set following training. The models’ ability to generalize to new cases was assessed quantitatively by determining each model’s accuracy on the test data. The following evaluation metrics were employed to gauge model performance: The percentage of correctly identified examples of all occurrences in the test set is known as accuracy. The Receiver Operating Characteristic (ROC) curve illustrates the true positive rate with respect to the false positive rate, offering valuable information about the model’s class discrimination capabilities. The area under the curve (AUC)—by calculating the area under the ROC curve, AUC values provide an overall measure of the model’s performance. Better discriminating powers are shown by higher AUC values.
3.4. Classification Report
A classification report provides a more thorough understanding of the models’ performance for each class (truth and deception) by presenting metrics such as precision, recall, and F1-score.
Taken as a whole, these metrics offer a thorough evaluation of how well the trained models categorize reviews as honest or dishonest. A more comprehensive assessment of the models’ performance is made possible by the classification report, which provides details on the precision, recall, and F1-score for each class, in addition to the ROC curve and AUC values, which indicate its discriminatory power.
3.5. Overall Project and Improvements and Applications and Results
To be used as input into the models, the dataset is tokenized, preprocessed, and converted into sequences. Word embeddings are used, and Word2Vec is used to construct an embedding matrix. A regularization dropout layer is incorporated into the construction of a sequential LSTM model. A Receiver Operating Characteristic (ROC) curve and accuracy plots are used to train, assess, and display the models’ performance. On the test set, the LSTM model attains a particular level of accuracy. Three layers are constructed in the sequential CNN model: a global max-pooling, 1D convolutional, and embedding layer. The model undergoes training, evaluation, and performance visualization. On the test set, the CNN model attains a particular level of accuracy.
A SimpleRNN layer and an embedding layer are used to build a sequential RNN model. The model undergoes training, evaluation, and performance visualization. On the test set, the RNN model achieves a specific accuracy. For every model, ROC curves are plotted to create a visual comparison of their performance. Classification reports provide information on each model’s F1-score, recall, and precision. Each model’s training history is displayed on a single plot, making it possible to compare the accuracies of their validation and training over epochs.
The selected models can be used to categorize customer reviews as being honest or dishonest, which helps identify opinion spam. Additional enhancements could include experimenting with other topologies, using ensemble approaches, and fine-tuning hyperparameters to optimize model performance. This all-encompassing method, using LSTM, CNN, and RNN models, offers a deep analysis of the dataset and reveals the efficacy of each model for the particular goal of identifying honest and dishonest evaluations.
4. Results
Sentiment analysis is an essential part of natural language processing that seeks to ascertain a text’s emotional tone. In this study, using a dataset with 800 instances of both true and false reviews, we investigated the sentiment analysis performance of three distinct deep learning models: a Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Recurrent Neural Network (RNN). Different conclusions about these algorithms’ ability to extract sentiment from text were drawn from their evaluation. The best-performing model was the CNN model, which obtained an accuracy of 98%, 77%, and 80% for the training, testing, and validation set, respectively, as illustrated in
Figure 12. Its capacity to identify enduring relationships within consecutive data was advantageous in comprehending the intricate background of the evaluations.
With an accuracy of 60%, 61%, and 60% for the training, testing, and validation set, respectively, the LSTM model came in a close second, demonstrating balanced precision and recall for both true and false classes. However, the RNN model trailed behind, only attaining 87%, 57%, and 58% accuracy for the training, testing, and validation set, respectively, as indicated in
Figure 13, suggesting it had difficulties in accurately capturing sequential relationships. A thorough understanding of each model’s performance can be obtained by looking over the tables below, i.e.,
Table 1,
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6. For both classes, the CNN model showed high precision, recall, and F1-score values, indicating a balanced performance for the training, testing, and validation set, respectively. The accuracy, recall, and F1-score values of the LSTM model showed a similar pattern, averaging 60% for both classes. For both classes, the RNN model’s accuracy, recall, and F1-score values were approximately 57%, indicating its limits.
A graphical representation of a binary classification model’s performance across different threshold settings is called a Receiver Operating Characteristic (ROC) curve. Based on a model’s ROC curve, the area under the curve (AUC) is a scalar value that measures the model’s overall performance. The CNN model’s AUC of 0.82 is considered to be reasonably good, indicating its strong discriminatory power and successful class distinction. In comparison to the CNN model, the LSTM model’s AUC of 0.65 is regarded as moderate, suggesting some discriminating ability but that it still needs work. With an AUC of 0.60, the RNN model shows weak discriminatory power and is closer to random guessing (0.5), indicating a lack of capacity to discern between classes. In this comparison, the CNN model is the most effective, followed by LSTM, while the RNN model shows the worst discriminatory power. Overall, AUC values closer to 1 indicate greater performance, as represented in
Figure 13.
5. Conclusions and Future Scope
In conclusion, deep learning models—that is, Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM), and Recurrent Neural Networks (RNNs)—explore sentiment analysis and provide subtle insights into its efficacy. When it comes to extracting sentiment from textual data, the CNN model surpasses the LSTM and the RNN models in terms of accuracy, precision, recall, and F1 score. Different deep learning models perform at different levels in sentiment analysis, according to this study. The CNN model outperforms both RNN and LSTM with accuracy rates of 98%, 77%, and 80% for the training, testing, and validation sets, respectively, demonstrating its greater accuracy and discriminatory power. At 60%, the LSTM shows competitive precision and recall for both true and false classes, despite its moderate accuracy. For the training, testing, and validation sets, the RNN model performs poorly, particularly when it comes to capturing sequential relationships, as evidenced by its lower accuracy rates of 87%, 57%, and 58%.
The models’ performance is further contextualized by their area under the curve (AUC) ratings. The strong performance of the CNN model is further supported by the Receiver Operating Characteristic (ROC) curve and area under the curve (AUC) analyses, with an AUC of 0.82. This AUC value, which shows good class distinction and strong discriminatory power, is regarded as reasonably good. With an AUC of 0.65, the LSTM model is considered moderate, indicating some discriminating power but that it still requires work. With an AUC of 0.60, the RNN model has poor discriminatory power; it is less able to distinguish between classes and is more akin to random guessing (0.5). The comparative analysis highlights the effectiveness of the CNN model, which is followed by the LSTM model, but the RNN model shows the worst discriminatory power. To reduce overfitting problems and enhance model generalization, future research should concentrate on overcoming data restrictions. Studying deep learning algorithms other than RNNs, LSTM, and CNNs might reveal more information about sentiment analysis. Improvements in model performance could come from adjusting model parameters, adding larger datasets, and experimenting with pre-trained embeddings. To take advantage of the advantages of several algorithms, research could also explore hybrid models or ensemble approaches. The effectiveness of sentiment analysis algorithms must be improved via ongoing assessments and adjustments to changing datasets and language subtleties.