Text-Mining-Based Non-Face-to-Face Counseling Data Classification and Management System

Park, Woncheol; Oh, Seungmin; Park, Seonghyun

doi:10.3390/app142210747

Open AccessArticle

Text-Mining-Based Non-Face-to-Face Counseling Data Classification and Management System

by

Woncheol Park

,

Seungmin Oh

^*

and

Seonghyun Park

^*

Department of Computer Engineering, Kongju National University, Cheonan 31080, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10747; https://doi.org/10.3390/app142210747

Submission received: 20 October 2024 / Revised: 12 November 2024 / Accepted: 18 November 2024 / Published: 20 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

This study proposes a system for analyzing non-face-to-face counseling data using text-mining techniques to assess psychological states and automatically classify them into predefined categories. The system addresses the challenge of understanding internal issues that may be difficult to express in traditional face-to-face counseling. To solve this problem, a counseling management system based on text mining was developed. In the experiment, we combined TF-IDF and Word Embedding techniques to process and classify client counseling data into five major categories: school, friends, personality, appearance, and family. The classification performance achieved high accuracy and F1-Score, demonstrating the system’s effectiveness in understanding and categorizing clients’ emotions and psychological states. This system offers a structured approach to analyzing counseling data, providing counselors with a foundation for recommending personalized counseling treatments. The findings of this study suggest that in-depth analysis and classification of counseling data can enhance the quality of counseling, even in non-face-to-face environments, offering more efficient and tailored solutions.

Keywords:

non-face-to-face counseling; counseling data; text mining; TF-IDF (Term Frequency–Inverse Document Frequency); Word Embedding

1. Introduction

Counseling has become an essential area in modern society. Particularly after the pandemic, the development of non-face-to-face communication methods and digital platforms has led to a rapid increase in the demand for online counseling. Many individuals experiencing psychological problems or emotional difficulties prefer non-face-to-face counseling, which offers greater anonymity and comfort, compared to traditional face-to-face methods. However, expressing emotions and situations accurately, as one might expect from machines, presents challenges. While some individuals may clearly articulate their emotions, many struggle to convey their thoughts and feelings properly. This difficulty often leads to a lack of self-awareness regarding their emotional state. Non-face-to-face counseling also presents limitations for counselors. The inability to observe non-verbal cues, such as facial expressions or tone of voice, makes it challenging to fully comprehend the client’s emotional and psychological state. Consequently, this can hinder the counselor’s ability to understand the client’s inner self and accurately assess their feelings. A lack of counselor expertise is another issue often highlighted. Clients may receive limited information when consulting with counselors who lack extensive knowledge or experience in counseling [1]. Volunteer counselors, in particular, may struggle to provide effective counseling, as many have undergone only short-term training. This lack of professionalism often leads to ineffective counseling outcomes. In fact, only 9.3% of volunteer counselors continue to provide counseling for more than five years. Thus, professional support, such as workshops and supervision, is essential for sustaining volunteer counseling efforts [2,3,4]. On the other hand, long-term volunteers demonstrate greater confidence and satisfaction in their counseling abilities [5]. Non-face-to-face counseling does offer benefits, including accessibility, convenience, and anonymity, that allow clients to share their issues more openly without fear of judgment. This environment often encourages more honest emotional expression. However, anonymity can also impede trust-building between clients and counselors, potentially leading to conflicts or a lack of objectivity and fairness in the counseling process due to mistrust of the counselor [6]. Despite the advantages of anonymity and convenience, non-face-to-face counseling faces challenges regarding the lack of expertise and continuity, limiting its effectiveness in delivering quality counseling services. To address these issues, this paper proposes a novel non-face-to-face counseling system. The proposed system uses text-mining techniques to analyze the psychological state of clients, even when they struggle to articulate their emotions or problems clearly in a non-face-to-face setting. Text mining serves as a tool for analyzing the conversation data between clients and counselors to detect emotions and psychological states. Through this approach, counselors can infer unexpressed emotions and provide more accurate assessments. The process of this study involves several key steps. First, the counseling data are collected and preprocessed. Next, essential information is extracted from the text using TF-IDF (Term Frequency–Inverse Document Frequency) and Word Embedding techniques. These methods help identify meaningful patterns and emotional states within the client’s conversation. The data are then automatically classified into predefined categories, allowing the system to search the database for the most appropriate counseling data. This provides counselors with customized solutions and the most suitable therapeutic recommendations based on the client’s situation. These insights can significantly enhance the quality of counseling and offer more effective and personalized solutions for clients.

2. Related Works (Text Mining)

As information technology advances and the volume of data continues to grow, it has become increasingly essential to develop methods that can process large amounts of unstructured data quickly and accurately. The need for the rapid extraction of meaningful information from these data is critical as the speed of data flow increases [7,8,9]. Traditional data analysis methods typically rely on users directly analyzing and interpreting data, which introduces the risk of subjective bias and is not well suited for handling massive datasets efficiently. However, text-mining technology can overcome these limitations by classifying large datasets into various topics, extracting key information, and allowing for deeper insights into each topic [10,11]. Text mining is a technique that extracts valuable information from unstructured or text-based data, leveraging NLP (Natural Language Processing) to uncover correlations, classify data, and summarize key findings. This technology has broad applications across numerous fields [12,13,14]. It incorporates methods such as NLP, information retrieval, document classification, clustering, text link analysis, and term extraction [15]. As the volume of web-based data rapidly increases, the importance of text mining has been amplified, particularly as it helps structure and analyze semi-structured text data, visualize results, and uncover new insights. As a result, text mining has gained widespread attention in research, highlighting its increasing significance [16,17]. With improvements in accuracy, the scope of text-mining applications has expanded across various domains. It is used to classify large volumes of documents, extract representative keywords, perform large-scale document searches, index and recommend content, and explore specific topics [18,19,20]. For example, studies using LDA (Latent Dirichlet Allocation) topic modeling include a case search and classification system designed to assist the general public in searching case law [21], a methodology for visualizing digital twin environments [22], and an analysis of online news articles aimed at identifying social issues related to patient safety [23]. Additional research has focused on predicting genes associated with diseases and uncovering new gene–disease relationships [24] and on using text mining to suggest rare diseases based on patient symptoms to aid in diagnosis [25]. Moreover, text mining has been applied to social media data for various purposes, such as analyzing customer needs [26,27,28,29] and assessing emotional changes to predict suicide risk [30,31,32,33]. While research in applying text-mining techniques has expanded across various fields, studies specifically focused on counseling data remain in their early stages, presenting challenges in direct comparison with existing studies. Nonetheless, studies with similar approaches include those analyzing mental health and predicting suicide risk through EHRs (Electronic Health Records) data. For example, a UK NHS study highlighted that relying solely on administrative codes could overlook up to 83% of patient records at risk for suicide, emphasizing the value of text-mining techniques to more precisely identify risk factors [34]. Additionally, research utilizing social media data for suicide risk prediction has demonstrated the effectiveness of text mining in detecting individual risks or tracking trends within groups through real-time data analysis [31]. In counseling-related research, some studies have used text mining to analyze themes and trends in university counseling programs, systematically categorizing major counseling areas through clustering methods to identify primary topics within these programs [35]. In the academic counseling domain, topic modeling techniques, such as Latent Dirichlet Allocation (LDA), have been employed to identify frequently occurring keywords and research topics, analyzing their frequencies to reveal key areas of interest [36]. These studies primarily focus on identifying expressions of positive, neutral, and negative sentiments through sentiment analysis to explore user experiences and emotional changes, or on analyzing classifications and frequencies of major topics. In contrast, this study proposes a system that automatically classifies counseling data into five categories—school, friends, personality, appearance, and family—representing a distinct approach from previous studies. By combining TF-IDF and Word Embedding, this study optimizes the analysis of emotional elements and semantic similarities across topics within counseling data, presenting an effective method for analyzing multidimensional, emotionally rich text data, as found in counseling records. This study is a challenging attempt to approach the practical utility of counseling data, emphasizing its significance in advancing the applicability of text mining in the field of psychological counseling. Research focused on understanding clients’ emotions and psychological states through counseling data analysis remains relatively scarce, highlighting a gap in psychological counseling research using text data. Given the increasing importance of non-face-to-face counseling, particularly in response to rising demand for remote support, there is a pressing need for systems capable of analyzing clients’ conditions and offering targeted counseling solutions based on comprehensive counseling data analysis. This study, therefore, holds practical implications for developing systems that leverage counseling data to provide effective, data-driven psychological support.

3. Proposed Scheme

3.1. System Configuration

Figure 1 below is the overall configuration diagram of the system proposed in this paper.

The system proposed in this paper consists of three key components: a client server (user interface), a system server (text mining processing), and storage (for managing and storing counseling data). This system is implemented as a web service. The client server serves as the user interface, accessible via a web browser by clients or counselors. Through this interface, users can input counseling content and view real-time analysis results generated by the system. When a client submits counseling content, the input data are transmitted to the system server for processing. The system server is responsible for performing text mining and analysis, utilizing Python for text data processing. It employs text-mining techniques to analyze the input data and transmits the results back to the client server for display. The storage component manages the storage of clients’ counseling records and analysis results, using PostgreSQL to efficiently store and handle the data. This component retains both the counseling records and the analyzed results, which are used for future counseling sessions and treatments. The development environment of the proposed system is as follows: the programming languages used include Python (version 3.12.6), Ajax, ECMAScript 2023, HTML5, and CSS (Cascading Style Sheets), while Django (version 5.0) was employed as the framework. Django offers excellent compatibility with Python-based NLP processing, allowing for the effective application of the TF-IDF and Word Embedding techniques used in this study. Additionally, Django provides a range of functionalities essential for web application development, making it an ideal framework for efficiently managing data flow between the user interface and the system server. PostgreSQL (version 16.4) was selected as the database, an open-source relational database well suited for processing large volumes of data and offering excellent compatibility with the Django framework. Counseling records and classification results are stored in PostgreSQL, which facilitates the efficient management and retrieval of counseling data. The continuous accumulation of counseling records within the database plays a key role in improving the performance of the counseling model. Additionally, PostgreSQL offers stability and high-speed data retrieval performance, ensuring consistent efficiency as the system scales. With the expected need to process increasing volumes of counseling data and handle larger datasets in future studies, PostgreSQL’s scalability and robustness make it a highly suitable choice for this system. The server environment for the system was deployed using Amazon Web Services (AWSs).

3.2. System Process

Figure 2 below shows the overall process of the system proposed in this paper.

The overall process of the proposed system is composed of six main stages, each focusing on processing and analyzing counseling data to ultimately provide counselors with suitable information for the client.

Data Collection: In the first stage, the counseling data entered by the client are transmitted to the system server via the client server.
Data Purification: In the second stage, the counseling data undergo a purification process to eliminate unnecessary elements. During this stage, the data are tokenized into individual words, and stopword removal is applied to delete unimportant words. This leaves only the core text data required for analysis. The tokenized data are then processed using the Word Embedding technique to semantically vectorize each word into a Word2Vec model.
TF-IDF Weight Calculation: In the third stage, the purified data are used to calculate the weight of each word using TF-IDF. TF-IDF evaluates the significance of words, assigning lower weights to frequently occurring but semantically insignificant words. The TF-IDF weighted values are multiplied by the Word Embedding vectors to emphasize the meanings of important words.
Analysis and Classification: In the fourth stage, the data, now with completed weight calculations, undergo analysis and classification. During this process, similar words are clustered, and the resulting values help identify the primary topic of the counseling data.
Code Assignment: In the fifth stage, important words within each cluster are grouped, and the resulting values are output based on their order in each group. A code value is then derived by comparing the clustered data with a predefined word dictionary. This code is compared with the classification codes stored in the database, and a final code is assigned to the counseling data. The newly assigned code and the corresponding data are stored in the database for future analysis.
Result Provision: In the final stage, data with the same code value as the newly assigned code are retrieved from the database and presented to the counselor. This completes the system’s overall process, allowing the counselor to use the provided information to offer appropriate counseling solutions to the client.

3.3. Dataset

Data Collection and Preprocessing Tasks

The dataset used for data collection and training in the proposed system consists of 761 counseling records collected from clients aged 8 to 19 between 2017 and 2023, organized into text data. To enhance the system’s training accuracy, the data were stored in CSV (Comma-Separated Values) format and used for system learning. The data used in this study were pre-classified into five categories—friends, personality, appearance, family, and others—according to the counseling topics provided by clients. For each category, key characteristics such as the number of data points, proportions within the dataset, predominant topics, and average conversation length are presented in Table 1. Table 1 provides a detailed summary of the main attributes of the counseling dataset used in this study.

Data Count and Proportion: Each category reflects clients’ main areas of interest, with “Friends” representing the largest portion at 24.3% of the total data, followed by “Personality” and “Appearance” with similar proportions, and “Family” and “Others” trailing. This distribution highlights the predominant topics clients discuss during counseling.
Main Topics: Key counseling topics within each category were identified to capture representative characteristics. For instance, in the “Friends” category, frequent topics include friendship, conflict, and communication, while the “Appearance” category centers on themes such as appearance assessment, body shape, facial features, and insecurities. In the “Others” category, topics mainly involve academics and school-related concerns.
Conversation Length: Average, maximum, and minimum conversation lengths were analyzed by category, quantitatively reflecting the complexity of discussions. The “Friends” category has an average conversation length of about 180 words, with some dialogues extending to 400 words, indicating more in-depth discussions around relationships and conflicts. In contrast, the “Family” and “Others” categories have shorter average conversation lengths of 155 and 150 words, respectively, which may suggest more straightforward discussions on family and academic topics.

A data refinement process was subsequently applied to the collected data. In the first purification process, spacing, blank spaces, and typographical errors in the CSV file were corrected. Morphological analysis and tokenization were performed using the Okt function, resulting in the extraction of 2314 nouns. Following this, a second purification process was applied to remove nouns with low analysis value or low relevance. In this step, the stopword list provided by the NLTK (Natural Language Toolkit) was referenced, and additional unnecessary words were included in the stopword list to remove terms lacking meaningful content. As a result, 646 nouns were extracted through the second purification process. These 646 extracted nouns were weighted by calculating the importance of each word using the TF-IDF technique. Table 2 below presents a portion of the results from the TF-IDF process.

The words that achieved high TF-IDF scores were primarily related to interpersonal characteristics, such as “friend”, “courage”, “best friend”, “violence”, “favor”, “disregard”, “pride”, and “homosexuality”. To emphasize words with greater importance in the data, the tokenized words obtained from the initial data purification process were semantically vectorized using the Word2Vec model. The previously calculated TF-IDF weights were then applied to calculate the Word Embeddings. In this study, the Word2Vec model was trained with parameters set to vector_size = 100, window = 5, min_count = 1, and workers = 4, using the Skip-gram approach to capture the emotional context in counseling data more effectively. These settings allow the model to better capture semantic relationships between significant words within the dataset. A combination of TF-IDF and Word Embedding techniques was employed to enable a nuanced analysis tailored to the unique characteristics of counseling data. TF-IDF reduces the weight of frequently occurring words while emphasizing relatively meaningful ones, whereas Word Embedding learns and vectorizes semantic similarities. By integrating these techniques, we achieved a fine-grained analysis that incorporates both the importance of keywords and the semantic relationships among them. For example, when “friend” and “conflict” appear together, Word Embedding with TF-IDF weighting enhances the importance of “conflict”, effectively highlighting the core topic of the session. In cases where the relationship between words like “trust” and “friend” is significant, this approach not only increases the weight of “trust” but also preserves its contextual connection to “friend”, maintaining the emotional context. This combined method plays a crucial role in enhancing semantic accuracy in counseling data analysis. The analysis of counseling data revealed that applying Word Embedding with TF-IDF weighting achieved higher accuracy compared to using TF-IDF alone. When TF-IDF was used independently, the clustering and topic modeling evaluation score was 0.7009. However, when TF-IDF weights were applied to the Word Embeddings, the score increased significantly to 0.9852. This indicates that applying TF-IDF weights to Word Embeddings is more suitable for capturing the semantic similarity of words related to emotional or personal expressions.

The resulting Word Embeddings were clustered using the KMeans algorithm, with the number of clusters set to six. The top 10 words from each cluster were output, and their TF-IDF importance was represented textually to highlight the relative significance of each word. Table 3 presents the words by cluster and their corresponding TF-IDF importance.

As demonstrated in the table above, the analysis and classification of the input data for the system’s training identified five keywords—school, friends, personality, external objects, and family—as important for each cluster. Data with high correlations to each of these top five words were added to a synonym list, and the keywords were assigned code values to create a word dictionary, which was then stored in the database. Subsequent input data underwent the same preprocessing and morphological analysis as the initial data. The extracted words were then compared against the existing word dictionary for each cluster, and the document’s cluster was determined based on the cluster with the highest occurrence of words. The code value was assigned according to the predefined dictionary. Table 4 below shows the final word dictionary stored in the database.

The Category No. represents the index number of the category that serves as the basis for classification in the database. After the analysis process of the input data is completed, the Category No. acts as the code for retrieving existing consultation data that correspond to the analyzed consultation data, providing a foundation for the search. The Category Name refers to the name of the category, while the Category Keyword is a synonym field. When the analysis results are derived by cluster, the data are classified and assigned a category number based on the corresponding cluster and the synonym field stored in the existing database (Table 3).

4. Results

To evaluate the accuracy of the proposed system, the following test was conducted. A total of 250 existing counseling diary entries were input into the system. The input data consisted of conversational text, with 50 counseling diaries provided for each category school, friends, personality, appearance, and family resulting in 250 cases in total. Table 5 below presents the results of the confusion matrix derived for each category based on the input counseling data processed by the proposed system.

Among the data categorized under the school category, 44 entries were correctly classified as school-related, 40 as friend-related, 40 as personality-related, 49 as appearance-related, and 42 as family-related. The misclassified entries included 6 in the school category, 10 in friends, 10 in personality, 7 in appearance, and 8 in family (Table 5). Table 6 below summarizes the evaluation results of the classification performance by measuring the accuracy, precision, recall, and F1-Score for each category.

As shown in the table above, the appearance category exhibited the highest performance, with an accuracy of 0.98, while the friend category showed the lowest performance, with a precision of 0.75. Overall, the F1-Score ranged from 0.77 to 0.87 across the categories. The school category demonstrated an accuracy of 0.88, a precision of 0.89, a recall of 0.88, and an F1-Score of 0.87, indicating that school-related counseling data had a clear pattern and relatively high performance. The friend category showed somewhat lower performance, with an accuracy of 0.80, a precision of 0.75, a recall of 0.80, and an F1-Score of 0.77. The personality category achieved an accuracy of 0.80, a precision of 0.82, a recall of 0.80, and an F1-Score of 0.81, indicating that the classification process was more challenging due to the diversity and complexity of personality-related texts. The appearance category showed the highest accuracy at 0.98, with a precision of 0.86, a recall of 0.87, and F1-Score of 0.86, as appearance-related counseling data had a clear topic and minimal overlap with other categories. Lastly, the family category recorded an accuracy of 0.84, a precision of 0.91, a recall of 0.84, and an F1-Score of 0.87, benefiting from a clear topic focus in the texts, leading to higher precision and F1-Score (Table 5).

5. Discussion

In this study, we developed a system that automatically classifies counseling data into five categories school, friends, personality, appearance, and family using text-mining techniques, and we evaluated its performance. The experimental results indicated satisfactory overall system performance, with the highest accuracy observed in the appearance and school categories. However, relatively lower performance was observed in the friends and personality categories, suggesting the need for additional refinement.

The performance of the school category was notably high, with an accuracy of 0.88, a precision of 0.89, a recall of 0.88, and an F1-Score of 0.87. With both precision and recall exceeding 0.88, it is evident that the system achieved a balanced rate of correct predictions for school-related data and a high level of accuracy for the predicted category. The F1-Score of 0.87, along with consistent precision and recall, indicates that school-related counseling data have distinct patterns that were effectively captured, leading to minimal confusion with other categories. The clear distinction of school-related terms, such as “study”, “exam”, and “school life”, allowed the system to perform stable classification for this category.

In contrast, the performance in the friends category was lower, with an accuracy of 0.80, a precision of 0.75, a recall of 0.80, and an F1-Score of 0.77. Although the accuracy of 0.80 was relatively strong, the precision of 0.75 suggests that the system misclassified a significant portion of data as belonging to the friends category. This may be due to the overlap of friend-related issues with other categories, such as family relationships or personality, which caused classification ambiguity. Despite the recall of 0.80, indicating that 80% of actual friend-related data were correctly predicted, the overall F1-Score of 0.77 reflects the need for improvement in terms of accuracy.

The counseling data in the “Friends” category frequently overlaps thematically with the “Family” or “Personality” categories. For example, cases where individuals rely on friends due to family conflicts or encounter friendship difficulties due to personality issues illustrate this overlap. Such thematic overlap introduces ambiguity when categorizing certain counseling data as “Friends” versus “Family” or “Personality”.

This overlap may lead to misclassification, where data intended for the “Friends” category are instead categorized under “Family” or “Personality”. This ambiguity could contribute to the relatively lower classification performance (e.g., accuracy and precision) observed in the “Friends” category.

The personality category demonstrated similar performance to the friends category, with an accuracy of 0.80, a precision of 0.82, a recall of 0.80, and an F1-Score of 0.81. Personality-related data often reflect clients’ personal experiences or situations, making feature extraction more complex. The polysemy and context dependence of personality-related text likely contributed to the system’s performance challenges. Nevertheless, the F1-Score of 0.81 and minimal discrepancy between precision and recall indicate stable and reliable classification results overall.

For the appearance category, the system exhibited outstanding performance, with an accuracy of 0.98, a precision of 0.86, a recall of 0.87, and an F1-Score of 0.86. The accuracy of 0.98 highlights the system’s ability to accurately classify nearly all appearance-related counseling data. The precision and recall were also excellent, at 0.86 and 0.87, respectively. Given that words related to appearance tend to focus on specific external features and are less likely to overlap with other categories, the high performance in this category is indicative of the system’s reliability.

The family category achieved an accuracy of 0.84, a precision of 0.91, a recall of 0.84, and an F1-Score of 0.87. The high precision of 0.91 indicates that the system accurately classified family-related counseling data. With a recall of 0.84, the system correctly predicted 84% of the actual family-related data. The F1-Score of 0.87 suggests that the system achieved a good balance between precision and recall for this category. Family-related topics are typically well-defined in counseling data, contributing to the system’s strong performance.

In summary, the proposed system achieved the highest performance in the appearance and school categories, while relatively lower performance was observed in the friends category. The discrepancy between precision and recall in the friends category suggests the need for additional data training and improvement, particularly considering the data imbalance or characteristic differences. For all categories except friends, the F1-Score exceeded 0.80, confirming the overall strong classification and analysis performance of the system.

This study aimed to develop a system for the real-time automatic classification of non-face-to-face counseling data, enabling counselors to analyze and respond promptly. While prior research has mainly focused on analyzing themes in university counseling, identifying trends in academic counseling, and examining emotional shifts or keyword categorization, this study emphasizes real-time data analysis and the integration of emotional context to support quick decision-making by counselors.

Distinguishing itself, this study combines the TF-IDF and Word Embedding techniques to automatically classify counseling data into five categories: school, friends, personality, appearance, and family. This combination enhances the system’s ability to reflect emotional characteristics and semantic relationships among topics, offering new possibilities for analyzing emotional and psychological aspects not extensively explored in prior research. By applying an analysis method optimized for multidimensional counseling data with emotional context, this study demonstrates the feasibility of precise classification.

Additionally, while existing studies often utilize techniques like Latent Dirichlet Allocation (LDA) or TF-IDF for general document classification or trend analysis, this study optimizes these techniques specifically for counseling data, achieving high accuracy in categorizing the counseling dataset. This approach offers novel potential for capturing relationships between emotional characteristics and themes in non-face-to-face counseling data.

Future research could focus on further improving the system’s accuracy and reliability through comparisons with advanced NLP models, such as BERT, and enhancing its practical utility by incorporating feedback from real counseling sessions. This study, therefore, serves as foundational research in counseling data analysis, demonstrating that text mining can effectively support counselors in providing prompt and accurate assistance.

Although direct comparisons with previous studies are limited, this work contributes to the field by exploring the potential for classifying counseling data through adapted text-mining techniques. It is expected to be a vital cornerstone for future advancements in non-face-to-face counseling support systems.

6. Conclusions

In this study, we propose a system that automatically classifies counseling data into predefined categories using text-mining techniques and evaluates its performance. By optimizing and applying existing text-mining methods to counseling data, we achieved more refined classification results, particularly through the combination of TF-IDF and Word Embedding. The classification performance evaluation showed high accuracy and F1-Scores across the predefined categories (school, friends, personality, appearance, family), indicating that the system effectively captures the multidimensional characteristics of counseling data. Notably, the system’s strong performance in emotionally and subjectively nuanced categories like “appearance” and “personality” underscores the practical utility of this approach.

The system proposed in this study is designed to automatically analyze non-face-to-face counseling data, allowing for the rapid classification of key emotional elements and psychological states that may require counselor intervention. In real counseling settings, this system has the potential to provide practical support to counselors, contributing to counseling efficiency and the provision of customized responses. Specifically, in terms of efficient counseling time utilization, the system automatically categorizes the primary topics and emotional states within counseling data, enabling counselors to quickly identify essential information. This allows counselors to focus on enhancing the quality and effectiveness of counseling while reducing the time spent on data analysis. The system also holds promise for improving consistency and quality in counseling responses. Based on the categorized data, counselors can develop structured response strategies for issues such as interpersonal relationships, family problems, and self-esteem concerns, providing reliable analysis results that support more effective counseling, especially for novice counselors. Furthermore, by analyzing counseling data to recommend optimized counseling directions for each individual, the proposed system demonstrates the potential for offering tailored solutions. For example, if “friendship” and “self-esteem” are identified as primary counseling topics, the system can automatically suggest tailored counseling strategies to deliver appropriate guidance to the client. As the volume of counseling records grows with the increase in non-face-to-face counseling, this system is well suited for processing and analyzing large-scale counseling data. This enables the efficient management and analysis of multiple client records, facilitating data-driven decision-making in counseling environments and ultimately enhancing the overall quality of counseling. A limitation of this study is that the dataset used was limited to a specific age group (8–19 years), which may have restricted the generalizability of the results. Future research will expand by incorporating data from various age groups and demographic backgrounds to improve the model’s generalizability. To address misclassification issues observed in the “Friends” category, future work will explore the implementation of multi-label or hierarchical classification methods to more accurately capture the overlapping characteristics among categories. Although this study trained the system on a somewhat limited dataset of 761 records, future efforts will involve leveraging a larger volume of counseling data and applying diverse techniques to reduce misclassification and further enhance classification performance. In conclusion, this study demonstrates the potential of applying text-mining techniques to counseling data and marks an important initial step toward automated text analysis in counseling. Future research should utilize more diverse age groups and larger datasets to improve the system’s generalizability. Additionally, the in-depth exploration of various text-mining techniques and classification algorithms will be crucial for further performance enhancement. By developing a system that can automatically classify counseling data in real time, this approach can directly benefit counselors by saving time and improving counseling quality, ultimately offering significant advantages to both counselors and clients.

Author Contributions

Methodology, W.P. and S.P.; investigation, S.P.; software, W.P. and S.P. writing—original draft preparation, W.P. and S.P.; writing—review and editing, W.P. and S.O.; supervision, S.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1C1C1010692).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the fact that our research data set includes private data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Eo, J.K. Telephone Counselor’s Continued Education Needs. A study on the Long-term Volunteer Experience of Christian Woman Telephone Counselors. Ph.D. Dissertation, Department of Christian Counseling, Graduate School of Christian Studies. Baekseok University, Cheonan, Republich of Korea, 2016. [Google Scholar]
Chang, Y.R.; Son, E.Y. Telephone Counselor’s Continued Education Needs. Korea J. Couns. 2007, 8, 467–481. [Google Scholar] [CrossRef]
Dunkley, J.; Whelan, T.A. Vicarious traumatization; Current status and future direction. Br. J. Guid. Couns. 2007, 34, 107–116. [Google Scholar] [CrossRef]
Linley, P.A.; Joshep, S. Therapy Work and Therapist’s Positive and Negative Well-Being. J. Soc. Clin. Psychol. 2007, 26, 385–403. [Google Scholar] [CrossRef]
Kong, M.H.; Rou, S.D. Feminist Identity and Its Effects on Job Satisfaction Among Female Telephone Counselling Volunteers. Korean J. Couns. 2004, 5, 43–59. [Google Scholar]
Suh, S.A.; Han, J.H. Analysis on Psychological Adjustment Process of Cyber Counselor for Adolescent. KCA 2016, 17, 21–36. [Google Scholar] [CrossRef]
Jeon, B.J.; Choi, Y.J.; Kim, H.W. Application Development for Text Mining: KoALA. Inf. Syst. Rev. 2019, 21, 117–137. [Google Scholar] [CrossRef]
Rafael, A.C.; Preeta, M.B. Measuring patent’s influence on technological evolution: A study of knowledge spanning and sub-sequent inventive activity. Res. Policy 2015, 44, 508–521. [Google Scholar] [CrossRef]
Nezamuldeen, L.; Jafri, M.S. Text Mining to Understand Disease-Causing Gene Variants. Knowledge 2024, 4, 422–443. [Google Scholar] [CrossRef]
Kam, M.A.; Song, M. A Study on Differences of Contents and Tones of Arguments among Newspapers Using Text Mining Analysis. J. Intell. Inf. Syst. 2012, 18, 53–77. [Google Scholar] [CrossRef]
Przybyła, P.; Shardlow, M.; Aubin, S.; Bossy, R.; Castilho, E.D.; Piperidis, S.; McNaught, J.; Ananiadou, S. Text mining re-sources for the life sciences. Database 2016, 2016, baw145. [Google Scholar] [CrossRef]
Chung, H.N.; Kim, D.H.; Goh, B.O. The Development and Application of Cyber Counseling System for the Gifted Class. JKAIE 2004, 8, 177–190. [Google Scholar]
Kim, S.W.; Kim, N.G. A Study on the Effect of Using Sentiment Lexicon in Opinion Classification. J. Intell. Inf. Syst. 2014, 20, 133–148. [Google Scholar] [CrossRef]
Kim, K.H.; Oh, S.R. Methodology for Applying Text Mining Techniques to Analyzing Online Customer Reviews for Market Segmentation. J. Korea Contents Assoc. 2009, 9, 272–284. [Google Scholar] [CrossRef]
Goo, J.N.; Kim, K.A. Text Mining for Korean: Characteristics and Application to 2011 Korean Economic Census Data. Korean J. Appl. Stat. 2014, 27, 1207–1217. [Google Scholar] [CrossRef]
Jung, Y.B.; Park, E.S. Keyword Analysis of Two SCI Journals on Rock Engineering by using Text Mining. Tunn. Undergr. Space 2015, 25, 303–319. [Google Scholar] [CrossRef]
Cho, S.G.; Kim, S.B. Finding Meaningful Pattern of keywords in IIE Transactions Using Text Mining. J. Korean Inst. Ind. Eng. 2012, 38, 67–73. [Google Scholar] [CrossRef]
Patricia, T.; Moses, L.D. Text mining applied to distance higher education: A systematic literature review. Educ. Inf. Technol. 2023, 29, 10851–10878. [Google Scholar] [CrossRef]
Sundaram, G.; Berleant, D. Automating Systematic Literature Reviews with Natural Language Processing and Text Mining: A Systematic Literature Review. In Proceedings of Eighth International Congress on Information and Communication Technology; Springer: Singapore, 2023; Volume 9, pp. 73–92. [Google Scholar] [CrossRef]
Wang, J.; Liu, J.; Wang, C. Keyword Extraction Based on PageRank. In Advances in Knowledge Discovery and Data Mining. PAKDD 2007; Springer: Berlin/Heidelberg, Germany, 2007; Volume LNAI 4426, pp. 857–864. [Google Scholar] [CrossRef]
Sim, J.S.; Kim, H.J. A Searching Method for Legal Case Using LDA Topic Modeling. J. Inst. Electron. Inf. Eng. 2017, 54, 67–75. [Google Scholar] [CrossRef]
Kuzma, K.; Yury, R.; Alexey, B. Digital Twins: A Systematic Literature Review Based on Data Analysis and Topic Modeling. Data 2022, 7. [Google Scholar] [CrossRef]
Kim, N.R.; Lee, N.J. An Analysis of Changes in Social Issues Related to Patient Safety Using Topic Modeling and Word Co-occurrence Analysis. J. Korea Contents Assoc. 2020, 21, 92–104. [Google Scholar] [CrossRef]
Jang, G.U.; Yoon, Y.M. Predicting Disease-related Genes Using Biomedical Literature Based on GloVe Word Embedding. J. Korean Inst. Inf. Technol. 2020, 18, 1–14. [Google Scholar] [CrossRef]
Choi, J.M.; Kim, S.Y. Early Detection Assistance System for Rare Diseases based on Patient’s Symptom Information. Korea Inst. Electron. Commun. Sci. 2023, 18, 373–378. [Google Scholar] [CrossRef]
Shounak, P.; Baidyanath, B.; Rohit, G.; Ajay, K.; Shivam, G. Exploring the factors that affect user experience in mobile-health applications: A text-mining and machine-learning approach. J. Bus. Res. 2023, 156, 113484. [Google Scholar] [CrossRef]
Conti, D.; Gomez, C.E.; Jaramillo, J.G.; Ospina, V.E. Monitoring the Quality and Perception of Service in Colombian Public Service Companies with Twitter and Descriptive Temporal Analysis. Appl. Sci. 2023, 13, 10338. [Google Scholar] [CrossRef]
Youm, D.; Kim, J. Text Mining Approach to Improve Mobile Role Playing Games Using Users’ Reviews. Appl. Sci. 2022, 12, 6243. [Google Scholar] [CrossRef]
Wen, Z.; Chen, Y.; Liu, H.; Liang, Z. Text Mining Based Approach for Customer Sentiment and Product Competitiveness Using Composite Online Review Data. J. Theor. Appl. Electron. Commer. Res. 2024, 19, 1776–1792. [Google Scholar] [CrossRef]
Lee, Y.R.; Kwon, H.I. Analysis of Twitter Post with ‘Self-Iinjury’ and ‘Ssuicide’ Using Text Mining. Korean J. Cult. Soc. Issues 2023, 29, 147–170. [Google Scholar] [CrossRef]
Lee, J.E. News Big Data Analysis of Elderly Suicide in Korea Using Text Mining. J. Korea Contents Assoc. 2024, 24, 70–81. [Google Scholar] [CrossRef]
Park, S.H.; Yu, K.L. Analysis of Instagram Posts Related to Self-Injury and Suicide Using Text Mining. Korean J. Couns. Psychother. 2021, 33, 1429–1455. [Google Scholar] [CrossRef]
Park, B.G.; Park, S.R. A Study on The Expression of Spatial Meaning Through Text Mining Analysis—Focusing on Big Data about Suicide on the Bridge. Korean Inst. Spat. Des. 2021, 16, 181–190. [Google Scholar] [CrossRef]
Jennifer, M.B.; Julie, M.K. A Critical Review of Text Mining Applications for Suicide Research. Curr. Epidemiol. Rep. 2022, 9, 126–134. [Google Scholar] [CrossRef]
Yang, S.M.; Kim, S.B. Analysis of Research Trends in Counseling Program for Domestic University Students Using Text Mining Methods. Korean Soc. Cult. Converg. 2023, 45, 113–122. [Google Scholar] [CrossRef]
Hyun, Y.C.; Yang, J.H.; Park, J.H. Analysis of Trends in Domestic Learning Counseling Research Using Text Mining Methods. J. Converg. Inf. Technol. 2022, 12, 302–310. [Google Scholar] [CrossRef]

Figure 1. System configuration diagram.

Figure 2. Process of the proposed system.

Table 1. Summary of key characteristics of the counseling dataset.

Category	Friend	Personality	Appearance	Family	Ohter	Total
Data Count	185	177	176	116	107	761
Data Proportion	24.3%	23.3%	23.1%	15.2%	14.1%	100%
Key Topics	Friendship, Conflict, Communication, …	Self-esteem, Confidence, Stress, Self-doubt, …	Body, Face, Complex, Voice, …	Nagging, Conversation, Relationship, Brothers and Sisters, …	Exam, Career, Grades, …	-
Avg. Conversation Length (words)	180	170	160	155	150	165
Max Conversation Length (words)	400	390	370	360	350	400
Min Conversation Length (words)	60	55	53	52	50	50

Table 2. This is a part of the result of the TF-IDF process.

	Word	TF-IDF		Word	TF-IDF		Word	TF-IDF
1	courage	0.6051	16	body	0.3306	31	annoyance	0.2662
2	running buddy	0.5304	17	failure	0.3239	32	complex	0.2773
3	violence	0.7657	18	rumor	0.3237	33	growth	0.3237
4	teacher	0.6949	19	misunder standing	0.3767	34	laugh	0.2698
5	friend	1.0000	20	gossip	0.3986	35	lie	0.2769
6	trust	0.8146	21	grade	0.3864	36	likeable	0.6779
7	unfair	0.6136	22	study	0.3696	37	pin money	0.3842
8	homo sexuality	0.5879	23	worry	0.2920	38	friendship	0.8489
9	bust	0.3210	24	face	0.4709	39	loneliness	0.9697
10	give up	0.3209	25	school	0.4051	40	club	0.5349
11	pride	0.5773	26	voice	0.3364	41	hatred	0.5668
12	self-doubt	0.5712	27	mom	0.3847	42	argument	0.3498
13	self-esteem	0.4013	28	protagonist	0.3956	43	fight	0.3427
14	apperance	0.3596	29	letter	0.3848	44	bullying	0.2959
15	looks	0.3573	30	work out	0.3457	45	ignore	0.7299
·····

Table 3. This is a table that organizes the word output results by cluster.

	Cluster 1	Cluster 2	Cluster 3	Cluster 4	Cluster 5
Word 1	school	friend	personality	appearance	family
Word 2	club	interest	annoyance	armpit	mom
Word 3	study	running buddy	hatred	banter	older sister
Word 4	teacher	misunderstanding	experience	bust	dad
Word 5	pal	gossip	effort	height	nagging
Word 6	subjects	friendship	disobey	fashion	relationship
Word 7	violence	secret	affirmation	body	lie
Word 8	tattle	trust	talkative	gait	conversation
Word 9	cigarette	rumor	introvert	face	yearning
Word 10	grade	homosexuality	shy	complex	answer

Table 4. This is the dictionary of category words.

Category No.	Category Name	Category Keyword
01	School	club pal violence ···
02	Friend	running buddy gossip friendship ···
03	Personality	affirmation annoyance shy ···
04	Appearance	body complex armpit ···
05	Family	nagging conversation relationship ···

Table 5. These are the results of deriving the confusion matrix for each category.

	School	Friend	Personality	Appearance	Family
School	44	3	1	2	0
Friend	4	40	2	3	1
Personality	2	4	40	1	3
Appearance	1	4	2	49	0
Family	0	2	4	2	42

Table 6. Results of classification performance evaluation by measuring Accuracy, Precision, Recall, and F1-Score by category.

Category	Accuracy	Precision	Recall	F1-Score
School	0.88	0.86	0.88	0.87
Friend	0.80	0.75	0.80	0.77
Personality	0.80	0.82	0.80	0.81
Appearance	0.98	0.86	0.87	0.86
Family	0.84	0.91	0.84	0.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, W.; Oh, S.; Park, S. Text-Mining-Based Non-Face-to-Face Counseling Data Classification and Management System. Appl. Sci. 2024, 14, 10747. https://doi.org/10.3390/app142210747

AMA Style

Park W, Oh S, Park S. Text-Mining-Based Non-Face-to-Face Counseling Data Classification and Management System. Applied Sciences. 2024; 14(22):10747. https://doi.org/10.3390/app142210747

Chicago/Turabian Style

Park, Woncheol, Seungmin Oh, and Seonghyun Park. 2024. "Text-Mining-Based Non-Face-to-Face Counseling Data Classification and Management System" Applied Sciences 14, no. 22: 10747. https://doi.org/10.3390/app142210747

APA Style

Park, W., Oh, S., & Park, S. (2024). Text-Mining-Based Non-Face-to-Face Counseling Data Classification and Management System. Applied Sciences, 14(22), 10747. https://doi.org/10.3390/app142210747

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Text-Mining-Based Non-Face-to-Face Counseling Data Classification and Management System

Abstract

1. Introduction

2. Related Works (Text Mining)

3. Proposed Scheme

3.1. System Configuration

3.2. System Process

3.3. Dataset

Data Collection and Preprocessing Tasks

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI