A Text Segmentation Approach for Automated Annotation of Online Customer Reviews, Based on Topic Modeling
Abstract
:1. Introduction and Literature Survey
1.1. Background
- TopicDiff-LDA, an original text segmentation method has been developed. The method was experimentally evaluated on a benchmark dataset. Performance was better () than other popular unsupervised TS algorithms tested on the same data.
- A new TS-enhanced topic modeling approach to automated multiple-label text annotation has been proposed. The approach was evaluated in classifier model development experiments. Overall, it demonstrated a better performance than a state-of-the-art semi-supervised annotation method powered by Multilabel Topic Model (MLTM).
1.2. Related Work
2. Segmentation-enhanced Topic Modeling for Automated Text Annotation
2.1. Overview
2.2. Text Segmentation with TopicDiff-LDA
Algorithm 1: TopicDiff-LDA | |
Input: documents, minimum segment size (), LDA topic probability distribution | |
Output: document segments () | |
Objective function: , subject to | |
1: | Initialize |
2: | = Search() //call Search function |
3: | store the best solution: |
4: | function Search() //returns |
5: | initialize |
6: | (Segmentation()) //call Segmentation function |
7: | for () do |
8: | |
9: | Segmentation()) //call Segmentation function |
10: | if () then |
11: | |
12: | |
13: | update //determined by the specific optimization algorithm deployed |
14: | return () |
15: | function Segmentation() //returns segmented documents |
16: | initialize array |
17: | for each do |
18: | = Sentence_tokenize() //split document into sentences |
19: | = TS( //call TS function |
20: | return () |
21: | function TS(, ) //returns segmented text |
22: | = Length //number of sentences |
23: | for () do |
24: | |
25: | if then |
26: | //define |
27: | //define |
28: | = LDA_infer() //compute the topic probab. distr. for |
29: | = LDA_infer() //compute the topic probab. distr. for |
30: | = //Manhattan distance |
31: | if then |
32: | appendto |
33: | = TS() |
34: | break |
35: | else |
36: | appendto |
37: | break |
38: | return () |
3. Text Segmentation Experiments
3.1. Data
3.2. Experiments
4. Case Study
4.1. Training Data and The Knowledge Model Used
- TOPIC 1.
- Historical Sites: museum, building, build, house, palace, dutch, mosque, time, collection, old.
- TOPIC 2.
- Protected Area: forest, park, animal, national_park, species, bird, include, plant, conservation, type.
- TOPIC 3.
- Natural Place: cave, river, location, district, road, hill, reach, tree, tourism, meter.
- TOPIC 4.
- Temple: statue, build, stone, side, meter, wall, king, find, roof, base.
- TOPIC 5.
- Mountain: mountain, mount, crater, sea_level, hill, high, peak, regency, view, scenery.
- TOPIC 6.
- Beach: sea, fish, wave, boat, coast, small, beauty, white_sand, reach, sand.
- TOPIC 7.
- General Information: park, travel, tour, want, facility, get, provide, activity, good, offer.
- TOPIC 8.
- Things to Buy: market, batik, food, tourism, product, traditional, plantation, fruit, sell, find.
- TOPIC 9.
- Cultural Heritage: traditional, dance, name, come, hold, become, call, ceremony, day, culture.
“Watu Dodol Tourism Object in Banyuwangi is located in Kalipuro district, Banyuwangi regency. The location is on Bypass Banyuwangi to Situbondo. The distance from Banyuwangi to Watudodol is 14 km, and from Ketapang port is only 5 km. Watudodol beach usually is full of local tourists for weekends or holidays. The visitors can enjoy the panoramic ocean or stroll to the hill located across the road. From the top of the hill, a beautiful panorama of the Bali strait can be seen. Culinary activities are another interesting thing to do here. Souvenirs made of shells and also stones are on sale in small shops. Arriving at Watudodol from the north route, the Gandrung statue welcomes visitors. This statue is the icon of Banyuwangi; Gandrung is a traditional dance from this city. Located close to Gandrung Statue, there is a big rock that looks like dodol (food made of fruits); probably because of this, the area is called Watudodol. Watu is a Javanese word for rock or stones. There was a mystical story about this rock. The Japanese occupied this area during World War 2, and the Japanese considered this rock distracting their activities. They tried to remove the rock by ordering men to cut the stones, but it did not work. The Japanese then decided to pull it with a boat, and still, it did not work; instead, the boat was drawn. Balinese and also truck drivers are said to put offerings on the rock until today.”
4.2. Segmentation
4.3. Reliability Assessment
4.4. Classifier Model Selection
4.5. Automatic Classification
5. Discussion
5.1. Segmentation Experiments
5.2. Labeling
5.3. Automatic Classification
“The Base G Beach (a former American WW2 base) is located about 10 km west of the city of Jayapura, Papua. The beach is beautiful and from here you can look at the Pacific Ocean which is the gateway for ships sailing by from the west. The Base G beach is quiet and still very natural and clean. The water is clear and the beach is made of white sand. The water is so clear you can see clearly through the underwater scenery. Besides enjoying the scenery you can also go swimming, fishing, diving or rent a boat and sail around a bit. Local residents have built some benches and cabins to chill and hang out if you get tired of the sun. There are also several types of trees providing shade. While in Jayapura be sure to visit Base G Beach as it never hurts to spend a day and enjoy this beautiful beach.”
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sharma, D.; Kumar, B.; Chand, S. A survey on journey of topic modeling techniques from SVD to deep learning. Int. J. Mod. Educ. Comput. Sci. 2017, 9, 50. [Google Scholar] [CrossRef] [Green Version]
- Chauhan, U.; Shah, A. Topic Modeling Using Latent Dirichlet allocation: A Survey. ACM Comput. Surv. 2021, 54, 1–35. [Google Scholar] [CrossRef]
- Cao, L.; Fei-Fei, L. Spatially Coherent Latent Topic Model for Concurrent Segmentation and Classification of Objects and Scenes. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–21 October 2007; IEEE: Rio de Janeiro, Brazil, 2007; pp. 1–8. [Google Scholar] [CrossRef]
- Kaviani, R.; Ahmadi, P.I.; Gholampour, I. Automatic Accident Detection Using Topic Models. In Proceedings of the ICEE 2015—Proceedings of the 23rd Iranian Conference on Electrical Engineering, Tehran, Iran, 10–14 May 2015; IEEE: Tehran, Iran, 2015; Volume 10, pp. 444–449. [Google Scholar] [CrossRef]
- Kim, S.; Narayanan, S.; Sundaram, S. Acoustic Topic Model for Audio Information Retrieval. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 18–21 October 2009; IEEE: New Paltz, NY, USA, 2009; pp. 37–40. [Google Scholar] [CrossRef]
- Emonet, R.; Varadarajan, J.; Odobez, J.M. Temporal Analysis of Motif Mixtures Using Dirichlet Processes. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 140–156. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gallinucci, E.; Golfarelli, M.; Rizzi, S. Advanced topic modeling for social business intelligence. Inf. Syst. 2015, 53, 87–106. [Google Scholar] [CrossRef]
- Qiang, J.; Qian, Z.; Li, Y.; Yuan, Y.; Wu, X. Short Text Topic Modeling Techniques, Applications, and Performance: A Survey. IEEE Trans. Knowl. Data Eng. 2020, 34, 1427–1445. [Google Scholar] [CrossRef]
- Wan, C.; Peng, Y.; Xiao, K.; Liu, X.; Jiang, T.; Liu, D. An Association-Constrained LDA Model for Joint Extraction of Product Aspects and Opinions. Inf. Sci. 2020, 519, 243–259. [Google Scholar] [CrossRef]
- Kovacs, M.; Kryssanov, V.V. A Semi-automatic Approach for Requirement Discovery in the E-commerce Industry. Int. J. Knowl. Eng. 2018, 4, 68–71. [Google Scholar] [CrossRef] [Green Version]
- Niu, Y.; Zhang, H.; Li, J. A Nested Chinese Restaurant Topic Model for Short Texts with Document Embeddings. Appl. Sci. 2021, 11, 8708. [Google Scholar] [CrossRef]
- Chen, Y.S.; Chen, L.H.; Takama, Y. Proposal of LDA-Based Sentiment Visualization of Hotel Reviews. In Proceedings of the 15th IEEE International Conference on Data Mining Workshop, ICDMW 2015, Atlantic City, NJ, USA, 14–17 November 2017; IEEE: Atlantic City, NJ, USA, 2015; pp. 687–693. [Google Scholar] [CrossRef]
- Li, J.; Xu, L.; Tang, L.; Wang, S.; Li, L. Big Data in Tourism Research: A Literature Review. Tour. Manag. 2018, 68, 301–323. [Google Scholar] [CrossRef]
- Li, Q.; Li, S.; Zhang, S.; Hu, J.; Hu, J. A Review of Text Corpus-Based Tourism Big Data Mining. Appl. Sci. 2019, 9, 3300. [Google Scholar] [CrossRef] [Green Version]
- Liao, X.; Zhao, Z. Unsupervised Approaches for Textual Semantic Annotation, A Survey. ACM Comput. Surv. 2019, 52, 1–45. [Google Scholar] [CrossRef] [Green Version]
- Nassar, L.; Karray, F. Overview of the crowdsourcing process. Knowl. Inf. Syst. 2019, 60, 1–24. [Google Scholar] [CrossRef]
- Canito, A.; Marreiros, G.; Corchado, J.M. Automatic Document Annotation with Data Mining Algorithms. Adv. Intell. Syst. Comput. 2019, 930, 68–76. [Google Scholar] [CrossRef]
- Olaode, A.; Naghdy, G. Review of the application of machine learning to the automatic semantic annotation of images. IET Image Process. 2019, 13, 1232–1245. [Google Scholar] [CrossRef]
- Asghari, M.; Sierra-Sosa, D.; Elmaghraby, A.S. A topic modeling framework for spatio-temporal information management. Inf. Process. Manag. 2020, 57, 102340. [Google Scholar] [CrossRef]
- Vavliakis, K.N.; Symeonidis, A.L.; Mitkas, P.A. Event Identification in Web Social Media through Named Entity Recognition and Topic Modeling. Data Knowl. Eng. 2013, 88, 1–24. [Google Scholar] [CrossRef]
- Tuarob, S.; Pouchard, L.C.; Mitra, P.; Giles, C.L. A generalized topic modeling approach for automatic document annotation. Int. J. Digit. Libr. 2015, 16, 111–128. [Google Scholar] [CrossRef]
- Amoualian, H.; Lu, W.; Gaussier, E.; Balikas, G.; Amini, M.-R.; Clausel, M. Topical Coherence in LDA-Based Models through Induced Segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1799–1809. [Google Scholar] [CrossRef] [Green Version]
- Hananto, V.R.; Kim, S.; Kovacs, M.; Serdült, U.; Kryssanov, V. A Machine Learning Approach to Analyze Fashion Styles from Large Collections of Online Customer Reviews. In Proceedings of the 6th International Conference on Business and Industrial Research (ICBIR 2021), Bangkok, Thailand, 20–21 May 2021; IEEE: Bangkok, Thailand, 2021; pp. 153–158. [Google Scholar] [CrossRef]
- Tagarelli, A.; Karypis, G. A Segment-Based Approach to Clustering Multi-Topic Documents. Knowl. Inf. Syst. 2013, 34, 563–595. [Google Scholar] [CrossRef] [Green Version]
- Manchanda, S.; Karypis, G. Text Segmentation on Multilabel Documents: A Distant-Supervised Approach. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 1170–1175. [Google Scholar] [CrossRef] [Green Version]
- Pak, I.; Teh, P.L. Text Segmentation Techniques: A Critical Review. In Innovative Computing, Optimization and Its Applications: Modelling and Simulations; Zelinka, I., Vasant, P., Duy, V.H., Dao, T.T., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 167–181. [Google Scholar] [CrossRef]
- Zhang, M.; Zhou, Z. A Review on Multi-Label Learning Algorithms. IEEE Trans. Knowl. Data Eng. 2014, 26, 1819–1837. [Google Scholar] [CrossRef]
- Rubin, T.N.; Chambers, A.; Smyth, P.; Steyvers, M. Statistical topic models for multi-label document classification. Mach. Learn. 2012, 88, 157–208. [Google Scholar] [CrossRef] [Green Version]
- Soleimani, H.; Miller, D.J. Semisupervised, Multilabel, Multi-Instance Learning for Structured Data. Neural Comput. 2017, 29, 1053–1102. [Google Scholar] [CrossRef] [PubMed]
- Zha, D.; Li, C. Multi-label dataless text classification with topic modeling. Knowl. Inf. Syst. 2019, 61, 137–160. [Google Scholar] [CrossRef] [Green Version]
- Santos, J.S.; Bernardini, F.; Paes, A. Measuring the degree of divergence when labeling tweets in the electoral scenario. In Proceedings of the Anais do X Brazilian Workshop on Social Network Analysis and Mining, Virtual Event. 18–23 July 2021; SBC: Porto Alegre, RS, Brasil; pp. 127–138. [Google Scholar] [CrossRef]
- Wang, W.; Guo, B.; Shen, Y.; Yang, H.; Chen, Y.; Suo, X. Robust supervised topic models under label noise. Mach. Learn. 2021, 110, 907–931. [Google Scholar] [CrossRef]
- Takanobu, R.; Huang, M.; Zhao, Z.; Li, F.L.; Chen, H.; Zhu, X.; Nie, L. A Weakly Supervised Method for Topic Segmentation and Labeling in Goal-oriented Dialogues via Reinforcement Learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence IJCAI, Stockholm, Sweden, 13–19 July 2018; pp. 4403–4410. [Google Scholar] [CrossRef] [Green Version]
- Shafqat, W. A Hybrid Approach for Topic Discovery and Recommendations Based on Topic Modeling and Deep Learning. Ph.D. Thesis, Jeju National University, Jeju City, Korea, 2020. Available online: http://oak.jejunu.ac.kr/handle/2020.oak/23245 (accessed on 19 November 2021).
- Meng, Y.; Huang, J.; Wang, G.; Wang, Z.; Zhang, C.; Zhang, Y.; Han, J. Discriminative Topic Mining via Category-Name Guided Text Embedding. In Proceedings of the Web Conference 2020 (WWW ‛20), Taipei, Taiwan, 20–24 April 2020; ACM: New York, NY, USA, 2020; pp. 2121–2132. [Google Scholar] [CrossRef]
- Hearst, M.A. TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages. Comput. Linguist. 1997, 23, 33–64. [Google Scholar] [CrossRef]
- Lu, Q.; Keenan, W.; Conrad, J.G.; Al-Kofahi, K. Legal Document Clustering with Built-in Topic Segmentation. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, Scotland, 24–28 October 2011; ACM: New York, NY, USA, 2011; pp. 383–392. [Google Scholar] [CrossRef]
- Li, W.; Matsukawa, T.; Saigo, H.; Suzuki, E. Context-Aware Latent Dirichlet Allocation for Topic Segmentation. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Singapore, 2020; Volume 12084, pp. 475–486. [Google Scholar] [CrossRef]
- Koshorek, O.; Cohen, A.; Mor, N.; Rotman, M.; Berant, J. Text Segmentation as a Supervised Learning Task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: New Orleans, LA, USA, 2018; Volume 2, pp. 469–473. [Google Scholar] [CrossRef] [Green Version]
- Neysiani, B.S.; Morteza Babamir, S. New Methodology for Contextual Features Usage in Duplicate Bug Reports Detection: Dimension Expansion Based on Manhattan Distance Similarity of Topics. In Proceedings of the 2019 5th International Conference on Web Research, ICWR 2019, Tehran, Iran, 24–25 April 2019; IEEE: Tehran, Iran, 2019; pp. 178–183. [Google Scholar] [CrossRef]
- Röder, M.; Both, A.; Hinneburg, A. Exploring the Space of Topic Coherence Measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining—WSDM’15, Shanghai, China, 2–6 February 2015; ACM: New York, NY, USA, 2015; pp. 399–408. [Google Scholar] [CrossRef]
- Syed, S.; Spruit, M. Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. In Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo, Japan, 19–21 October 2017; IEEE: Tokyo, Japan, 2017; pp. 165–174. [Google Scholar] [CrossRef]
- Choi, F.Y.Y. Advances in Domain Independent Linear Text Segmentation. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, Seattle, WA, USA, 29 April–4 May 2000; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2000; pp. 26–33. [Google Scholar]
- Griffiths, T.L.; Steyvers, M. Finding Scientific Topics. Proc. Natl. Acad. Sci. USA 2004, 101, 5228–5235. [Google Scholar] [CrossRef] [Green Version]
- Beeferman, D.; Berger, A.; Lafferty, J. Statistical Models for Text Segmentation. Mach. Learn. 1999, 34, 177–210. [Google Scholar] [CrossRef] [Green Version]
- Utiyama, M.; Isahara, H. A Statistical Model for Domain-Independent Text Segmentation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (ACL ‛01), Toulouse, France, 6–11 July 2001; Association for Computational Linguistics (ACL): Toulouse, France, 2001; pp. 499–506. [Google Scholar] [CrossRef] [Green Version]
- Misra, H.; Yvon, F.; Cappé, O.; Jose, J. Text Segmentation: A Topic Modeling Perspective. Inf. Process. Manag. 2011, 47, 528–544. [Google Scholar] [CrossRef] [Green Version]
- Du, L.; Buntine, W.; Johnson, M. Topic segmentation with a structured topic model. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 190–200. [Google Scholar]
- Glavaš, G.; Nanni, F.; Ponzetto, S.P. Unsupervised Text Segmentation Using Semantic Relatedness Graphs. In Proceedings of the *SEM 2016: The Fifth Joint Conference on Lexical and Computational Semantics, Berlin, Germany, 11–12 August 2016; Association for Computational Linguistics: Berlin, Germany, 2016; pp. 125–130. [Google Scholar]
- Li, J.; Sun, A.; Joty, S. SEGBOT: A generic neural text segmentation model with pointer network. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18), Stockholm, Sweden, 13–19 July 2018; AAAI Press: Stockholm, Sweden, 2018; pp. 4166–4172. [Google Scholar] [CrossRef] [Green Version]
- Hananto, V.R.; Serdült, U.; Kryssanov, V.V. A Tourism Knowledge Model through Topic Modeling from Online Reviews. In Proceedings of the 2021 7th International Conference on Computing and Data Engineering (ICCDE 2021), Phuket, Thailand, 15–17 January 2021; ACM: New York, NY, USA, 2021; pp. 87–93. [Google Scholar] [CrossRef]
- Rosenberg, A.; Binkowski, E. Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In Proceedings of the HLT-NAACL 2004: Short Papers (HLT-NAACL-Short ‛04), Boston, MA, USA, 2–7 May 2004; ACM: Boston, MA, USA, 2004; pp. 77–80. [Google Scholar]
- Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recog. 1997, 30, 1145–1159. [Google Scholar] [CrossRef] [Green Version]
- Alsmadi, I.; Gan, K.H. Review of short-text classification. Int. J. Web Inf. Syst. 2019, 15, 155–182. [Google Scholar] [CrossRef]
- Szymanski, P.; Kajdanowicz, T. Scikit-multilearn: A scikit-based Python environment for performing multi-label classification. J. Mach. Learn. Res. 2019, 20, 209–230. [Google Scholar]
- Pevzner, L.; Hearst, M.A. A critique and improvement of an evaluation metric for text segmentation. Comput. Ling. 2002, 28, 19–36. [Google Scholar] [CrossRef]
- Mariani, M.M.; Borghi, M.; Gretzel, U. Online reviews: Differences by submission device. Tour. Manag. 2019, 70, 295–298. [Google Scholar] [CrossRef] [Green Version]
- Artstein, R.; Poesio, M. Inter-Coder Agreement for Computational Linguistics. Comp. Linguist. 2008, 34, 555–596. [Google Scholar] [CrossRef] [Green Version]
- Bobicev, V.; Sokolova, M. Inter-Annotator Agreement in Sentiment Analysis: Machine Learning Perspective. In Proceedings of the Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, 4–6 September 2017; INCOMA Ltd.: Varna, Bulgaria, 2017; pp. 97–102. [Google Scholar] [CrossRef]
- Melzi, S.; Abdaoui, A.; Azé, J.; Bringay, S.; Poncelet, P.; Galtier, F. Patient’s rationale: Patient Knowledge retrieval from health forums. In Proceedings of the eTelemed 2014: Sixth Conference on eHealth, Telemedicine and Social Medicine, Barcelona, Spain, 23–27 March 2014; pp. 140–145. [Google Scholar]
- Bekkar, M.; Djemaa, H.K.; Alitouche, T.A. Evaluation measures for models assessment over imbalanced data sets. J. Inf. Eng. Appl. 2013, 3, 27–38. [Google Scholar]
- Jiang, C.; Liu, Y.; Ding, Y.; Liang, K.; Duan, R. Capturing Helpful Reviews From Social Media for Product Quality Improvement: A Multi-Class Classification Approach. Int. J. Prod. Res. 2017, 55, 3528–3541. [Google Scholar] [CrossRef]
- Parvin, T.; Hoque, M.M. An Ensemble Technique to Classify Multi-Class Textual Emotion. Procedia Com. Sci. 2021, 193, 72–81. [Google Scholar] [CrossRef]
- Polpinij, J.; Luaphol, B. Comparing of Multi-class Text Classification Methods for Automatic Ratings of Consumer Reviews. In Multi-Disciplinary Trends in Artificial Intelligence, Proceedings of the MIWAI 2021, Virtual Event, 2–3 July 2021; Chomphuwiset, P., Kim, J., Pawara, P., Eds.; Springer: Cham, Switzerland, 2021; Volume 12832, pp. 164–175. [Google Scholar] [CrossRef]
- Jähnichen, P.; Wenzel, F.; Kloft, M.; Mandt, S. Scalable generalized dynamic topic models. In Proceedings of the 21st Internatonal Conference on Artificial Intelligence and Statistics (AISTATS), Lanzarote, Spain, 9–11 April 2018; PMLR: Brookline, MA, USA, 2018; Volume 84, pp. 1427–1435. [Google Scholar]
- Tomasi, F.; Chandar, P.; Levy-Fix, G.; Lalmas-Roelleke, M.; Dai, Z. Stochastic Variational Inference for Dynamic Correlated Topic Models. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), Virtual Event. 3–6 August 2020; PMLR: Brookline, MA, USA, 2020; Volume 124, pp. 859–868. [Google Scholar]
Number of documents | 700 |
Number of unique tokens | 5210 |
Average document length | 83 (1910 tokens) |
Number of segments per document | 10 |
Segment length (number of documents) | 3–11 (400), 3–5 (100), 6–8 (100), 9–11 (100) |
Algorithm | Reference | |
---|---|---|
C99 | [43] | 0.105 |
U00 | [46] | 0.078 |
M09 | [47] | 0.027 * |
TSM | [48] | 0.009 ** |
GraphSeg | [49] | 0.066 |
SegBot | [50] | 0.003 *** |
TopicDiff-LDA | 0.029 |
Number of documents | 2685 (filtered from 2807 crawled) |
Number of unique tokens | 31,781 |
Average document length, sentences | 12 (278 tokens) |
Topic | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
Prob. | 0.002 | 0.001 | 0.252 | 0.129 | 0.001 | 0.318 | 0.013 | 0.150 | 0.130 |
Segment | Text | Label (Topic No.) |
---|---|---|
1 | “Watu Dodol Tourism Object in Banyuwangi is located in Kalipuro district, Banyuwangi regency. The location is on Bypass Banyuwangi to Situbondo. The distance from Banyuwangi to Watudodol is 14 km, and from Ketapang port is only 5 km. Watudodol beach usually is full of local tourists for weekends or holidays. The visitors can enjoy the panoramic ocean or stroll to the hill located across the road. From the top of the hill, a beautiful panorama of the Bali strait can be seen.” | Natural Place (3) |
2 | “Culinary activities are another interesting thing to do here. Souvenirs made of shells and also stones are on sale in small shops. Arriving at Watudodol from the north route, the Gandrung statue welcomes visitors. This statue is the icon of Banyuwangi; Gandrung is a traditional dance from this city. Located close to Gandrung Statue, there is a big rock that looks like dodol (food made of fruits); probably because of this, the area is called Watudodol. Watu is a Javanese word for rock or stones.” | Things to Buy (8) |
3 | “There was a mystical story about this rock. The Japanese occupied this area during World War 2, and the Japanese considered this rock distracting their activities. They tried to remove the rock by ordering men to cut the stones, but it did not work. The Japanese then decided to pull it with a boat, and still, it did not work; instead, the boat was drawn. Balinese and also truck drivers are said to put offerings on the rock until today.” | Historical Sites (1) |
Segment | Topic | ||||||||
---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
1 | 0.006 | 0.033 | 0.539 | 0.033 | 0.005 | 0.327 | 0.009 | 0.036 | 0.007 |
2 | 0.006 | 0.004 | 0.146 | 0.107 | 0.004 | 0.241 | 0.008 | 0.344 | 0.136 |
3 | 0.334 | 0.004 | 0.209 | 0.004 | 0.005 | 0.008 | 0.282 | 0.007 | 0.144 |
Class | Average Review Size, Sentences | No. of Distinct Labels Assigned (Avg Per Review) | |
---|---|---|---|
1 | 10 | 24 | 26 (1.35) |
2 | 11 | 13 | 15 (1.53) |
3 | 13 | 55 | 65 (1.48) |
4 | 9 | 18 | 21 (1.29) |
5 | 11 | 19 | 21 (1.48) |
6 | 14 | 67 | 73 (1.42) |
7 | 13 | 43 | 57 (1.46) |
8 | 17 | 13 | 15 (1.40) |
9 | 13 | 17 | 25 (1.68) |
Total labels: | 269 | 318 (1.18) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hananto, V.R.; Serdült, U.; Kryssanov, V. A Text Segmentation Approach for Automated Annotation of Online Customer Reviews, Based on Topic Modeling. Appl. Sci. 2022, 12, 3412. https://doi.org/10.3390/app12073412
Hananto VR, Serdült U, Kryssanov V. A Text Segmentation Approach for Automated Annotation of Online Customer Reviews, Based on Topic Modeling. Applied Sciences. 2022; 12(7):3412. https://doi.org/10.3390/app12073412
Chicago/Turabian StyleHananto, Valentinus Roby, Uwe Serdült, and Victor Kryssanov. 2022. "A Text Segmentation Approach for Automated Annotation of Online Customer Reviews, Based on Topic Modeling" Applied Sciences 12, no. 7: 3412. https://doi.org/10.3390/app12073412
APA StyleHananto, V. R., Serdült, U., & Kryssanov, V. (2022). A Text Segmentation Approach for Automated Annotation of Online Customer Reviews, Based on Topic Modeling. Applied Sciences, 12(7), 3412. https://doi.org/10.3390/app12073412