Next Article in Journal
Analyzing Brand Awareness Strategies on Social Media in the Luxury Market: The Case of Italian Fashion on Instagram
Next Article in Special Issue
Creating Location-Based Mobile Applications for Tourism: A Virtual AR Guide for Western Macedonia
Previous Article in Journal
Bridging Digital Approaches and Legacy in Archaeology
Previous Article in Special Issue
Greek Hotels’ Web Traffic: A Comparative Study Based on Search Engine Optimization Techniques and Technologies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

When BERT Started Traveling: TourBERT—A Natural Language Processing Model for the Travel Industry

1
Institute of Business Informatics, Johannes Keppler University of Linz, 4040 Linz, Austria
2
Department of Innovation and Management in Tourism, Salzburg University of Applied Sciences, 5412 Salzburg, Austria
3
Department of Tourism and Service Management, Modul University Vienna, 1190 Wien, Austria
*
Author to whom correspondence should be addressed.
Digital 2022, 2(4), 546-559; https://doi.org/10.3390/digital2040030
Submission received: 9 September 2022 / Revised: 25 October 2022 / Accepted: 28 October 2022 / Published: 11 November 2022
(This article belongs to the Collection Digital Systems for Tourism)

Abstract

:
In recent years, Natural Language Processing (NLP) has become increasingly important for extracting new insights from unstructured text data, and pre-trained language models now have the ability to perform state-of-the-art tasks like topic modeling, text classification, or sentiment analysis. Currently, BERT is the most widespread and widely used model, but it has been shown that a potential to optimize BERT can be applied to domain-specific contexts. While a number of BERT models that improve downstream tasks’ performance for other domains already exist, an optimized BERT model for tourism has yet to be revealed. This study thus aimed to develop and evaluate TourBERT, a pre-trained BERT model for the tourism industry. It was trained from scratch and outperforms BERT-Base in all tourism-specific evaluations. Therefore, this study makes an essential contribution to the growing importance of NLP in tourism by providing an open-source BERT model adapted to tourism requirements and particularities.

1. Introduction

Tourism products and services tend to be highly descriptive [1] as they cannot be tested in advance. In addition, tourism services are co-created with the customer and are relatively expensive compared to everyday products. As a result, the descriptions of products and services tend to be very excessive and text heavy. Alongside detailed descriptions from the supply side, user-generated content (UGC) continues to gain more relevance [2]. Whether on review platforms, such as TripAdvisor, or social media channels, such as Twitter, Facebook, or Instagram, individuals are constantly sharing their travel experiences and, in turn, influencing other users [3]. This content is of particular importance for tourism providers as they seem to be losing their power to UGC [4]. Therefore, to better understand consumer behavior and adapt to marketing initiatives, the automated analysis of texts using NLP methods is becoming increasingly important for both academia and the tourism industry [5]. At the same time, more powerful language models are emerging, enabling more advanced text analyses to be conducted.
BERT, developed by Google, is considered one of the most powerful and widely used language models. On the one hand, this pre-trained language model has been trained on a huge generic corpus and can be used universally. On the other hand, however, it has its weaknesses when it comes to domain-specific applications. Therefore, this paper aims to develop and evaluate a domain-specific BERT model for tourism. The proposed TourBERT model was pre-trained from scratch using 3.6 million tourist reviews and 46,000 descriptions of tourist services, attractions, and sights from more than 20 different countries around the world. This study makes a unique contribution to the extant body of natural language models and tourism research as the evaluation of TourBERT has proven its superiority to BERT in all tasks concerning tourism-relevant content. TourBERT can be rendered the state-of-the-art language model for the tourism industry and for academic text analytics alike owing to the fact that the pre-trained model can be fine-tuned to perform numerous tasks such as text representation, text classification and clustering, topic modeling, sentiment analysis, or question answering.

2. Literature Review

With an increase in computational power and more effective and efficient algorithms, abundant research has been conducted in recent years, both within academia and the tourism industry, on how to best process textual data. According to Wennker [6], 80% of all data that is produced is text-based, which underscores Poon’s [7] statement that "information is the lifeblood of tourism." Especially since the rise of UGC, a vast amount of unstructured text has become available at one’s disposal, the analysis of which can provide important insights into tourists and their wants, needs, and experiences that are highly relevant for tourism marketing [5].
Regardless, the analysis of text data is challenging and requires the conversion of text into numerical values, which are necessary to use as input data for powerful machine learning algorithms. Over the past years, a wide variety of language models have been developed, ranging from the pure analysis of word frequencies to complex transformer models that are able to process multilingual data and take content as well as context into account. Especially through the concept of transfer learning, which is based on the use of pre-trained models, huge progress in NLP has been archived. However, since such language models are trained on huge corpora, the training process is extremely time-consuming and computationally intense. The applied training corpus is therefore responsible for the field of application and the domain the model will work well in [8].
Since its launch in 2018, Google’s Bidirectional Encoder Representations from Transformers (BERT) is currently one of the most significant natural language models [9]. BERT-Large, which is based on a transformers architecture, is considered one of the most powerful language models, with 24 layers, 16 attention heads, and 340 million parameters in total [10]. It is a model pre-trained from scratch and can be fine-tuned to perform numerous downstream tasks such as text classification, question answering, sentiment analysis, extractive summarization, named entity recognition, or sentence similarity [8]. BERT-Base was pre-trained in a self-supervised way on a large English corpus consisting of raw texts from the BookCorpus dataset. This includes over 11,000 books in addition to the entire English Wikipedia. The nature of this training corpora implies that BERT was trained on a generic and unspecified domain corpus [11]. Yet, for domain-specific applications and downstream tasks, it has been proven that pre-training BERT on a large domain-specific corpus can be useful as it allows for better apprehension of linguistic peculiarities [12]. For example, several BERT variants have been pre-trained for the financial (FinBERT) [13], medical (Clinical BERT) [14], biological (BioBERT) [15], and computer science sectors (SciBERT) [16]. For tourism-related content, however, a domain-specific adaptation of BERT is not available on the market yet, hence why this paper introduces TourBERT. TourBERT will now be presented and evaluated in more detail in the next paragraphs.

3. Methodology and Results

The following sections describe the methodological procedure for the development of the TourBERT language model. The pre-training of TourBERT will be presented first, followed by its model evaluations. For the sake of clarity, the results of the five different evaluations are reported immediately after the description of each evaluation process.

3.1. Pre-Training TourBERT

TourBERT embodies BERT-Base-Uncased as its underlying architecture and was trained from scratch—unlike BioBERT or FinBERT, which were both pre-trained further from the BERT-Base initial checkpoint. The training corpus was pre-processed by converting the data into lowercase and splitting it into sentences, ultimately resulting in 22,601,333 sentences in total. Thereafter, two TourBERT models with SentencePiece and WordPiece tokenizers were trained, respectively. The motivation to use SentencePiece rather than conventional WordPiece tokenizers in conjunction with BERT was to establish an opportunity to extend TourBERT to a multi-language model in the future since SentencePiece is able to account for grammatical peculiarities of different complex languages like Chinese. To obtain a custom vocabulary, SentencePiece (32,000) and WordPiece (30,522) tokenizers were trained, with the latter being equal to the size of the BERT-Base tokenizer. Pre-training of both models was done for 1M steps on a single Google Colab Pro TPU instance, which lasted about three days in total.

3.2. TourBERT Model Evaluation

The evaluation of TourBERT was performed using both quantitative and qualitative measures. Two sentiment classification tasks were used for the supervised evaluation, while topic modeling, synonyms search, and a within-vocabulary words similarity distribution analysis were applied as part of the unsupervised evaluation. It is important to note that the evaluation of supervised tasks used SentencePiece tokenizers only since both models had comparable performance, as will be shown below.

3.2.1. Supervised Evaluation: Sentiment Classification

For classification purposes, BERT’s architecture must be extended with a classifier layer in order to enable predictions. This can be achieved in numerous ways; for example, one of the most widely used approaches is attaching a softmax layer on top of the BERT model. A more advanced way of designing a classifier, however, involves an Long short-term memory (LSTM) layer, which is useful for the representation of long sequences exceeding BERT’s maximum input length. In the case of TourBERT, outputs were passed through a single feed-forward layer, a simple classifier known for benchmarking different transformer models against each other. Keeping in mind that an architecture as such would not yield state-of-the-art results, the aim was simply to demonstrate that TourBERT can surpass BERT-Base without tending to achieve superior results on a particular dataset.
The sentiment classification task was performed on two publicly available datasets involving hotel reviews. The first dataset contains 69,308 hotel reviews from Tripadvisor [17] and includes three sentiment classes: {-1: “negative”, 0: ”neutral”, 1: “positive”}. The second dataset contains 515,000 reviews from Europe hotels [18]. Here, only reviews with either negative or positive labels were used, which, in turn, transformed this problem into a binary classification with the following two classes: {-1: “negative”, 1: “positive”}. The dataset contains attributes such as hotel name, number of reviews, and geographical position as well as negative and positive reviews from each reviewer. If a user had left only positive reviews, then the value for the negative reviews was left blank, and vice-versa. The following pre-processing approach was thus used to extract only positive and negative examples in order to prepare this dataset for a binary classification problem: Only reviews from users who left either only negative or only positive reviews were included. Using this approach, 35,000 positive and 35,000 negative reviews were sampled resulting in 70,000 samples in total.
Both datasets were first pre-processed and then split into training, validation, and testing sets according to a 80%/10%/10% proportion. The pre-processing procedures included lowercasing and the removal of punctuation and non-ASCII characters from the text. Evaluation results for both tasks are shown in Table 1 and Table 2 below, while Figure 1 presents the ROC curve and AUC score for TourBERT and BERT-Base models in the second task.

3.2.2. Unsupervised Evaluation: Visualization of Photo Annotations

The first unsupervised evaluation task was the visualization of photo annotations via TensorBoard Projector. For this task, a dataset of 48 photos depicting different tourism activities, such as sports activities, sightseeing, and shopping, amongst others, was applied. Next, 622 people were asked to manually label these photos by assigning two bi-gram tags to each individual photo. These annotations were then visualized using the TensorBoard Projector API, which allows for the visualization of original photos on a 2D or 3D plot located within their respective cluster centers. Finally, after performing UMAP, i.e., inspecting and comparing the groups’ separation quality on the plot, the evaluation was complete. The visualization results for BERT-Base and TourBERT are presented in Figure 2 and Figure 3, respectively.
The purpose of such a visualization is to evaluate the separation of clusters that naturally form from the down-projection method. Overall, one can observe that the TourBERT vectors lead to better group separation and that the pictures within each group contain similar content. Contrarily, when observing the results produced with BERT-Base vectors, the content of the pictures appear to be heavily mixed, without any visible cluster separation.

3.2.3. Unsupervised Evaluation: Topic Modeling

A subsequent unsupervised evaluation was undertaken by applying a topic modeling approach. For this, 5000 Instagram posts with the hashtag #wanderlust were extracted from public accounts and crawled using the Python Scrapy library. Instagram, as a social platform, principally utilizes photos to reflect its primary source of information, while the textual description of Instagram posts is often either limited to hashtags and emojis, unrelated to the photo, or missing entirely. Therefore, images were annotated using Google Cloud Vision API, and a TourBERT vector was generated for each photo annotation. Photo annotations were analyzed based on their similarity using a K-means clustering approach. The number of clusters was chosen using the silhouette score, which resulted in 25 clusters. In order to enable cluster center visualization on a 2D plot, a PCA down-projection method was selected to transform a 768-dimensional BERT embedding into a two-dimensional map.
Figure 4 below shows the cluster centers on a 2D plot, where the size of a cluster center is proportional to the cluster’s population size. A visualization as such allows the quality of the topic separation to be evaluated.
From Figure 4, one can notice that the cluster centers produced with the down-projected TourBERT vectors reveal better separation than those produced with BERT-Base ones.
Another aspect of the topic modeling analysis was the estimation of word similarity within the same cluster. Topic words for both BERT-Base and TourBERT can be seen in Table 3 and Table 4.
Although the hashtag #wanderlust may lead one to think of photos that, to some extent or another, contain natural landscapes, the topic model produced with TourBERT vectors was able to identify distinct topics like “underwater world” (topic 1), “beach activities” (topic 2), “food and drink” (topic 7), “vehicle” (topic 11), or “animals” (topic 24). An attempt to find similarly grouped clusters for the BERT-Base model did not result in such success since nearly every topic includes landscape descriptions. While several distinct topics were indeed found by the model, the majority of them contain mixed concepts, each one including terms describing nature or landscapes.
For better visibility and to gain a better understanding of the quality and distinction of the topics, another visualization for each of the two topic models was produced, as can be seen in Figure 5 and Figure 6. Each figure contains a table, with the first column presenting words for a given topic (see Table 3 and Table 4) and all subsequent columns depicting the top 10 most similar samples, i.e., photos for that topic.
When inspecting the results from both models, it becomes apparent that the clusters created through TourBERT are much more homogenous within the clusters themselves and quite heterogeneous across clusters. On the other hand, those generated by BERT-Base occasionally include photos that are relatively dissimilar to each other despite belonging to the same topic, such as in topic 3.

3.2.4. Unsupervised Evaluation: User Study

To further investigate the quality of each topic produced by the abovementioned models and prove the assumptions made thus far, a user study was conducted on the same set of images and annotations to statistically evaluate the results. First, a set of the 10 most similar photos for each of the 25 clusters produced by BERT-Base and TourBERT was created. Thereafter, users were asked to evaluate the similarity of the photos within each of the 50 clusters using a seven-point Likert scale, with possible answers ranging from “very similar” to “very different” (see Figure 7). Similar to measuring the intercoder reliability in qualitative studies, this evaluation approach allowed for an intersubjective perception of the quality of the clusters. Throughout this process, the image clusters were shown to the participants in a rotating manner, i.e., alternating randomly.
To investigate this study’s results, a pairwise t-test was performed with SPSS, the results of which are presented in Table 5 below. The coding ranged from 1—very similar to 7—very different, with the mean values being 3.75 and 2.5 for BERT-Base and TourBERT, respectively, at a highly significant level (Sig. two-sided = 0.000). Effect size was measured with Cohen´s d, yielding a medium-level effect of 0.517.
From the results above, it can be concluded that the similarity between the annotated images was perceived significantly better with TourBERT than with BERT-Base.

3.2.5. Unsupervised Evaluation: Synonyms Search

Assuming that BERT-Base, due to the fact that it had been trained on a generic corpus, would achieve more generic results than the TourBERT model, which had been trained on a tourism-specific corpus, it was hypothesized that a similarity search of tourism-related terms would lead to better results with TourBERT than with BERT-Base. Therefore, with the help of a tourism-domain expert, words containing multiple semantic meanings in general as well as tourism-specific contexts were selected. For example, the word “transfer” has multiple meanings and is usually associated with “transformation”, “transplantation”, and so on; however, from a tourist’s perspective, associations such as “taxi”, “pick up”, or “hotel transfer” might come to mind. The output of the top eight most similar words for each term can be seen in Table 6 and Table 7 for both BERT-Base and TourBERT alike.
From a technical perspective, the native implementation of BERT does not allow for the querying of most similar words since, unlike Word2Vec or FastText models, BERT does not contain static vectors but, rather, produces them dynamically. As a result, it can output two completely different vectors for the same word based on the context it was mentioned in. As the intention is still to compare words as standalone context-independent units, an algorithm that enables any BERT-like model to query its vocabulary in order to find the most similar words was constructed. The algorithm works as follows: For the first step, pairwise similarities between all the words in BERT’s vocabulary were computed resulting in a 30,522 × 30,522 matrix. Then, using the KDTree algorithm from Python’s Sklearn library, a search index was built on that matrix, which allows for fast querying.
When comparing synonyms produced by BERT-Base and TourBERT, one can see that TourBERT captures the tourism-specific meaning of a given word almost perfectly. On the contrary, BERT-Base captures a more generic meaning of the same word. For example, TourBERT associates the word “ticket” with “entrance” and “wristband”, whereas BERT-Base considers the same word in the scope of public transport, presenting words like “trains”, “bus”, and “metro”. To provide another example, the word “destination” is associated via the BERT-Base model with words such as “dying”, “choice”, “lame”, and “address”, whereas TourBERT outputs “spot”, “attraction”, “place”, and other words that are closely related to “destination” in a tourism context.

4. Conclusions

In tourism research as well as in the tourism industry, the automatic analysis of texts is becoming increasingly important. Language models are needed to perform a variety of downstream tasks such as topic modeling, text classification, entity recognition, sentiment analysis, or information extraction. However, it has been shown that the quality of the domain-specific use of pre-trained models depends significantly on the training corpus itself. While optimized language models have already been developed for business- and scientific domains, such as the financial [13], medical [14], or biological [15] sectors, this has yet to be the case for tourism. Therefore, the aim of this study was to optimize the most important and widely used language model to date, BERT, for tourism-specific applications. By means of five different evaluation tasks, the successful completion of all tasks could be demonstrated, proving the applicability and performance of TourBERT for tourism contexts. TourBERT outperformed BERT-Base in all domain-specific tasks and thus represents a suitable language model for academia and the tourism industry. This study further contributes to the discussion of the importance of domain-specific language models from a theoretical perspective, while, from a methodological point of view, it provides detailed insights into the development and training of TourBERT. As a result, this study can also be seen as a guide on how to train and evaluate BERT models for other domains. The practical contribution lies in making TourBERT available to the open-source community: The model is hosted on the Hugging Face Model Hub and accessible via https://huggingface.co/veroman/TourBERT (accessed on 23 May 2022). TourBERT is thus freely accessible and ready to use for tourism-specific NLP tasks. Although an attempt was made to ensure that the training corpus was as multi-layered as possible and that the intercultural dimension, a very important aspect for tourism, was taken into account, an even larger training corpus would most likely lead to increased performance rates. In particular, the inclusion of scientific texts would be useful at this point in order to better analyze texts, such as scientific books and papers, in the context of tourism.

Author Contributions

Conceptualization, V.A. and R.E.; methodology, V.A. and R.E.; evaluation, V.A. and R.E.; writing V.A. and R.E. All authors have contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This project was carried out without funding.

Data Availability Statement

We publicly release the TourBERT model which is available on Hugging Face Model Hub and is accessible through https://huggingface.co/veroman/TourBERT (accessed on 23 May 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Doolin, B.; Burgess, L.; Cooper, J. Evaluating the use of the Web for tourism marketing: A case study from New Zealand. Tour. Manag. 2002, 23, 557–561. [Google Scholar] [CrossRef]
  2. Yu, J.; Egger, R. Tourist Experiences at Overcrowded Attractions: A Text Analytics Approach. In Information and Communication Technologies in Tourism 2021; Springer: Cham, Switzerland, 2021; pp. 231–243. [Google Scholar]
  3. Daxböck, J.; Dulbecco, M.L.; Kursite, S.; Nilsen, T.K.; Rus, A.D.; Yu, J.; Egger, R. The Implicit and Explicit Motivations of Tourist Behaviour in Sharing Travel Photographs on Instagram: A Path and Cluster Analysis. In Information and Communication Technologies in Tourism 2021; Springer: Cham, Switzerland, 2021; pp. 244–255. [Google Scholar]
  4. Saraiva, J.P.D.P.M. Web 2.0 in restaurants: Insights regarding TripAdvisor’s use in Lisbon. Doctoral Dissertation, Universidade Catolica Protugesa, Lisboa, Portugal, 2013. [Google Scholar]
  5. Egger, R.; Gokce, E. Natural Language Processing: An Introduction. In Applied Data Science in Tourism. Interdisciplinary Approaches, Methodologies and Applications; Egger, R., Ed.; Springer: Berlin/Heidelberg, Germany, 2022; pp. 307–334. [Google Scholar]
  6. Wennker, P. Künstliche Intelligenz in der Praxis. In Anwendung in Unternehmen und Branchen: KI wettbewerbs- und zukunftsorientiert Einsetzen; Springer Gabler: Wiesbaden, Germany, 2020; Available online: https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6326361 (accessed on 23 May 2022).
  7. Poon, A. Tourism, Technology and Competitive Strategies; CAB International: Wallingford, UK, 1993. [Google Scholar]
  8. Egger, R. Text Representations and Word Embeddings. Vectorizing Textual Data. In Applied Data Science in Tourism. Interdisciplinary Approaches, Methodologies and Applications; Springer: Berlin/Heidelberg, Germany, 2022; pp. 335–361. [Google Scholar]
  9. Tenney, I.; Dipanjan, D.; Pavlick, E. BERT rediscovers the classical NLP pipeline. arXiv 2019, arXiv:1905.05950. [Google Scholar]
  10. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
  11. Edwards, A.; Camacho-Collados, J.; De Ribaupierre, H.; Preece, A. Go simple and pre-train on domain-specific corpora: On the role of training data for text classification. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 5522–5529. [Google Scholar]
  12. Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv 2020, arXiv:2004.10964. [Google Scholar]
  13. Araci, D. Finbert: Financial sentiment analysis with pre-trained language models. arXiv 2019, arXiv:1908.10063. [Google Scholar]
  14. Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly available clinical BERT embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar]
  15. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Beltagy, I.; Lo, K.; Cohan, A. Scibert: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
  17. Avishek Garain. Hotel Reviews from around the world with Sentiment Values and Review Ratings in different Categories for Natural Language Processing. IEEE Dataport. Available online: https://ieee-dataport.org/documents/hotel-reviews-around-world-sentiment-values-and-review-ratings-different-categories (accessed on 22 April 2020).
  18. Liu, J. 515K Hotel Reviews Data in Europe. 2019. Available online: https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe (accessed on 2 June 2021).
Figure 1. Area under ROC-Curve (AUC) scores for BERT-Base (a), TourBERT SentencePiece (b), and TourBERT WordPiece (c).
Figure 1. Area under ROC-Curve (AUC) scores for BERT-Base (a), TourBERT SentencePiece (b), and TourBERT WordPiece (c).
Digital 02 00030 g001aDigital 02 00030 g001b
Figure 2. TensorBoard Projector for BERT-Base (contains two views as a result of symmetric axes rotation).
Figure 2. TensorBoard Projector for BERT-Base (contains two views as a result of symmetric axes rotation).
Digital 02 00030 g002
Figure 3. TensorBoard Projector for TourBERT (contains two views as a result of symmetric axes rotation).
Figure 3. TensorBoard Projector for TourBERT (contains two views as a result of symmetric axes rotation).
Digital 02 00030 g003
Figure 4. Topic modeling results for BERT-Base (left) and TourBERT (right).
Figure 4. Topic modeling results for BERT-Base (left) and TourBERT (right).
Digital 02 00030 g004
Figure 5. The first six topics, with their respective cluster words and top 10 most similar images, produced by the K-means model with TourBERT vectors.
Figure 5. The first six topics, with their respective cluster words and top 10 most similar images, produced by the K-means model with TourBERT vectors.
Digital 02 00030 g005
Figure 6. The first six topics, with their respective cluster words and top 10 most similar images, produced by the K-means model with BERT-Base vectors.
Figure 6. The first six topics, with their respective cluster words and top 10 most similar images, produced by the K-means model with BERT-Base vectors.
Digital 02 00030 g006
Figure 7. Two examples of image clusters shown to the participants.
Figure 7. Two examples of image clusters shown to the participants.
Digital 02 00030 g007
Table 1. Evaluation results for TourBERT and BERT-Base models for datasets from Tripadvisor.
Table 1. Evaluation results for TourBERT and BERT-Base models for datasets from Tripadvisor.
Validation SetTest Set
LossAccuracyAccuracyPrecisionRecallF1
BERT-Base0.42500.81900.810.660.40.42
TourBERT (WordPiece)0.31460.87080.860.70.650.68
TourBERT (SentencePiece)0.31660.87120.870.70.650.68
Table 2. Evaluation results for TourBERT and BERT-Base models for datasets from Europe hotels.
Table 2. Evaluation results for TourBERT and BERT-Base models for datasets from Europe hotels.
Validation SetTest Set
LossAccuracyAccuracyAUC
BERT-Base0.22960.92180.92790.97
TourBERT (WordPiece)0.13710.95690.96330.99
TourBERT (SentencePiece)0.13290.95860.96260.99
Table 3. Topic words for 25 topics produced with BERT-Base vectors.
Table 3. Topic words for 25 topics produced with BERT-Base vectors.
Topic Words
0fashion, sleeve, shoulder, flash, flash photography, photography, street, street fashion, smile, hair, neck, eyewear, eyebrow, happy, sky
1shades, tints, tints shades, plant, black, sky, shirt, bicycle, photography, white, font, sleeve, wood, building, automotive
2sky, nature, water, landscape, plant, natural, cloud, people, tree, people nature, natural landscape, water sky, happy, cloud sky, azure
3automotive, vehicle, sky, tire, font, plant, landscape, design, wood, art, building, rectangle, cloud, water, lighting
4plant, natural, water, landscape, natural landscape, sky, ecoregion, cloud, tree, mountain, nature, cloud sky, highland, community, plant community
5water, landforms, sky, coastal, coastal oceanic, oceanic, oceanic landforms, landscape, cloud, natural, beach, water sky, natural landscape, azure, plant
6people, nature, sky, smile, people nature, sunglasses, flash, flash photography, photography, water, sleeve, care, vision, vision care, eyewear
7landscape, sky, plant, cloud, natural, natural landscape, water, tree, building, nature, cloud sky, mountain, vehicle, people, blue
8water, sky, cloud, landscape, plant, natural, natural landscape, resources, water resources, building, tree, mountain, cloud sky, water sky, nature
9landscape, plant, natural, sky, water, natural landscape, nature, cloud, tree, grass, people, people nature, cloud sky, sky plant, wood
10fashion, happy, sky, people, nature, photography, flash, flash photography, eyewear, smile, people nature, care, vision, vision care, plant
11plant, sky, water, natural, landscape, ecoregion, tree, natural landscape, cloud, photography, fashion, flash, flash photography, smile, happy
12plant, natural, landscape, water, natural landscape, sky, tree, dog, nature, grass, cloud, terrestrial, wood, people, landforms
13building, sky, plant, window, vehicle, facade, tree, wood, design, house, automotive, tire, cloud, road, city
14vehicle, automotive, sky, building, plant, tire, font, design, art, window, cloud, tree, wood, rectangle, lighting
15plant, shades, tints, tints shades, sky, wood, black, fashion, bicycle, photography, rectangle, people, white, building, font
16plant, water, natural, sky, landscape, natural landscape, cloud, ecoregion, mountain, tree, cloud sky, community, plant community, resources, water resources
17landscape, plant, water, sky, natural, natural landscape, shades, tints, tints shades, tree, cloud, landforms, wood, coastal, coastal oceanic
18fashion, sleeve, flash, flash photography, photography, street, street fashion, lip, shoulder, eyelash, eyebrow, smile, hairstyle, sky, neck
19water, sky, equipment, cloud, equipment supplies, supplies, boating, boating equipment, boats, boats boating, landforms, boat, watercraft, coastal, coastal oceanic
20water, landscape, natural, plant, sky, cloud, natural landscape, mountain, tree, nature, cloud sky, azure, highland, resources, water resources
21plant, water, sky, nature, landscape, natural, cloud, tree, people, natural landscape, people nature, grass, cloud sky, mountain, building
22sky, plant, cloud, water, landscape, building, natural, tree, natural landscape, mountain, cloud sky, window, nature, travel, road
23plant, natural, sky, landscape, water, natural landscape, tree, cloud, nature, terrestrial, terrestrial plant, flower, grass, petal, wood
24food, sky, cuisine, ingredient, recipe, tableware, dish, food tableware, ingredient recipe, water, tableware ingredient, staple, staple food, plate, produce
Table 4. Topic words for 25 topics produced with TourBERT vectors.
Table 4. Topic words for 25 topics produced with TourBERT vectors.
Topic Words
0plant, sky, tree, building, road, landscape, wood, cloud, road surface, surface, grass, window, sky plant, leisure, water
1diving, underwater, water, fluid, marine, equipment, biology, marine biology, organism, fish, water underwater, liquid, diving equipment, underwater diving, blue
2beach, people, water, sky, people beach, cloud, nature, people nature, water sky, azure, happy, travel, beach people, coastal, coastal oceanic
3landscape, mountain, natural, sky, cloud, natural landscape, plant, slope, tree, cloud sky, highland, snow, sky mountain, terrain, sky plant
4font, art, arts, event, rectangle, brand, design, pattern, graphics, photography, happy, painting, magenta, logo, visual
5building, sky, window, facade, tower, design, urban, city, cloud, urban design, plant, sky building, road, house, building window
6water, sky, afterglow, cloud, dusk, atmosphere, landscape, natural, natural landscape, sky atmosphere, cloud sky, sunlight, sunset, water sky, tree
7tableware, drinkware, table, bottle, cup, dishware, food, glass, wood, plant, furniture, device, stemware, kitchen, wine
8people, nature, sky, people nature, flash, flash photography, photography, happy, water, smile, plant, cloud, leg, gesture, tree
9water, sky, equipment, boat, watercraft, cloud, vehicle, lake, supplies, boating, boating equipment, boats, boats boating, equipment supplies, water sky
10care, vision, vision care, sunglasses, sleeve, eyewear, goggles, glasses, sky, dress, fashion, smile, shirt, flash, flash photography
11automotive, vehicle, tire, bicycle, wheel, motor, motor vehicle, automotive tire, vehicle automotive, sky, lighting, automotive lighting, car, plant, tire wheel
12plant, landscape, natural, natural landscape, sky, tree, nature, grass, community, plant community, cloud, people, people nature, water, sky plant
13sky, water, cloud, landscape, natural, atmosphere, cloud sky, blue, natural landscape, azure, plant, nature, tree, horizon, sunlight
14water, natural, landscape, sky, natural landscape, cloud, plant, nature, mountain, resources, water resources, ecoregion, tree, cloud sky, water sky
15temple, sky, building, architecture, plant, facade, city, cloud, art, travel, tree, leisure, sculpture, world, monument
16nature, plant, people nature, people, sky, happy, tree, landscape, cloud, natural, water, grass, natural landscape, travel, leisure
17wood, design, building, rectangle, interior, interior design, window, shades, tints, tints shades, property, font, furniture, flooring, plant
18food, cuisine, ingredient, tableware, recipe, dish, food tableware, ingredient recipe, produce, staple, staple food, cuisine dish, tableware ingredient, plate, cake
19fashion, street, street fashion, sleeve, eyewear, flash, flash photography, photography, shirt, happy, waist, smile, dress, design, shoe
20lip, eyebrow, eyelash, smile, hair, chin, shoulder, skin, nose, forehead, hairstyle, neck, eye, lip chin, facial
21plant, flower, tree, terrestrial, twig, landscape, terrestrial plant, natural, petal, natural landscape, branch, grass, wood, sky, flowering
22water, natural, plant, landscape, landforms, natural landscape, fluvial, fluvial landforms, landforms streams, streams, resources, water resources, sky, watercourse, water water
23water, landscape, landforms, natural, sky, coastal, coastal oceanic, oceanic, oceanic landforms, cloud, natural landscape, water sky, azure, resources, water resources
24dog, plant, animal, carnivore, breed, dog breed, fawn, sky, terrestrial, working, working animal, companion, companion dog, collar, grass
Table 5. Results of the paired t-test for samples mean comparison for TourBERT and BERT-Base models.
Table 5. Results of the paired t-test for samples mean comparison for TourBERT and BERT-Base models.
Paired Samples Statistics
MeanNStd. DeviationStd. Error Mean
Pair 1BERT3.7759820.716550.07913
TourBERT2.5239820.617240.06816
Paired Sample Test
MeanSDStd. EMtdfSig.
(2-tailed)
Pair 1BERT—TourBERT1.2520.517730.057121.898810.000
Paired Samples Effect Sizes
StandardizerPoint Estimate95% Confidence Interval
LowerUpper
Pair 1BERT—TourBERTCohen’s d0.517732.4181.9862.846
Hedges’ correction0.520152.4071.9772.833
Table 6. Synonyms search with BERT-Base.
Table 6. Synonyms search with BERT-Base.
AuthenticityExperienceEntranceAttractionTicketDestinationGuideTransferSightseeingService
legitimacyteachshelterattractionsticketsdyingcompanionrecoverytreesvessel
sincerityhealentrancesrestaurantfarechoiceentryexchangefireworksauthority
competencecommunicatearchwayhotelfareslamevisitimagingshopsheadquarters
authorshipconsumegateexhibitcardaddressdatabaserestoringpacingfacility
flexibilitylearnroofpaviliontrainsexitforumsalecomedyworkshop
integrityeatcausewaynightclubbuspartnerworkshopcomparisonprostitutescirculation
conscienceconsidertenantsmallmetrocorrectionaccessrecoveringsidewalkcompanion
characterizationexperiencesexitballroomfreightprioritiesgooglescreeningnightsoperation
Table 7. Synonyms search with TourBERT.
Table 7. Synonyms search with TourBERT.
AuthenticityExperienceEntranceAttractionTicketDestinationGuideTransferSightseeingService
uniquenessexperinceentrydestinationticketsspot##guidetransfersexploringsevice
ambienceexpereinceenterancefeatureentryattractionguidestransportsightsservices
originalityexperianceadmittancelandmarkentranceplacetourguidepickupattractionsstaff
intimacyadventureadmissionplacewristbandpointguidtransportationexplorationpersonnel
charmexperiencesticketinstitutionadmissionitinerarydriverjourneynightlifehospitality
accuracyenjoymentfeemuseumfeehotspotinterpreterlimousinehikingpersonel
flareopportunitycarparkspotpassventureguidingshuttleoutingsfrontdesk
warmthexperepaymentsitetixhangoutnarratorpickupsexcursionshousekeeping
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Arefeva, V.; Egger, R. When BERT Started Traveling: TourBERT—A Natural Language Processing Model for the Travel Industry. Digital 2022, 2, 546-559. https://doi.org/10.3390/digital2040030

AMA Style

Arefeva V, Egger R. When BERT Started Traveling: TourBERT—A Natural Language Processing Model for the Travel Industry. Digital. 2022; 2(4):546-559. https://doi.org/10.3390/digital2040030

Chicago/Turabian Style

Arefeva, Veronika, and Roman Egger. 2022. "When BERT Started Traveling: TourBERT—A Natural Language Processing Model for the Travel Industry" Digital 2, no. 4: 546-559. https://doi.org/10.3390/digital2040030

APA Style

Arefeva, V., & Egger, R. (2022). When BERT Started Traveling: TourBERT—A Natural Language Processing Model for the Travel Industry. Digital, 2(4), 546-559. https://doi.org/10.3390/digital2040030

Article Metrics

Back to TopTop