Next Article in Journal
Empirical Comparison between Deep and Classical Classifiers for Speaker Verification in Emotional Talking Environments
Previous Article in Journal
Improving Performance in Person Reidentification Using Adaptive Multiple Loss Baseline
 
 
Article
Peer-Review Record

UGC Knowledge Features and Their Influences on the Stock Market: An Empirical Study Based on Topic Modeling

Information 2022, 13(10), 454; https://doi.org/10.3390/info13100454
by Ning Li 1,*, Kefu Chen 2 and Huixin He 3
Reviewer 1:
Reviewer 2:
Information 2022, 13(10), 454; https://doi.org/10.3390/info13100454
Submission received: 16 July 2022 / Revised: 4 September 2022 / Accepted: 20 September 2022 / Published: 27 September 2022

Round 1

Reviewer 1 Report

This paper presents a framework for knowledge discovery from user-generated content in the domain of investor social networks. LDAvis as a visualization tool is used to help determine the parameters for the LDA-based topic modeling. A simple rule-based method is used to extract feature knowledge (risk scores) on the lexical level and a regression model is proposed to correlated the feature knowledge and the stock market in the short term (5-10 days). 

The paper does a fine job describing the framework. There are certain parts that need improvement:

- Writing: 1) punctuation issues see Sec 4.1 and other places (no space after a punctuation); 2) confusing expressions. Examples: a. “non-letter characters, and stops.” b. “After removing the stop words and taking the word root, the Corpora. " c. “Dictionary is used to generate the Dictionary.” d. Section 4.2” “According to the content described in the previous chapter” <- previous chapter? e. Page 5: “Lev put  forward three evaluation methods of the LDA model” <- what is Lev? f. “noun words” ->”nouns” 3) lack of references. Examples: WRDS database, SPACY; 4) Fig. 2 and 3 are not legible.

- It feels like the paper is taken out of another long paper. The description of the use of WordNet is missing.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

There is some  valuable experiment and results but its very difficult to decipher what it is and why it done. There are a couple of tasks done without description of why and how it contributes to an overrall objective. The statement describing the objective "The purpose of this paper is to examine the value of UGC in investment decision-making from the perspec- tive of knowledge discovery" is stated in the abstract, but I can't link your methodology to the statement.

Author Response

1. Response to comment 1): (statement describing the objective in the abstract cannot be linked to methodology)

Response: Thank you for your patient reading. The statement in the abstract “The purpose of this paper is to examine the value of UGC in investment decision-making from the perspective of knowledge discovery” linked to two correspondences in the manuscript. One is corresponding to the theoretical background in subsection 2.3, the other is corresponding to the empirical test results in subsection 5.1. The oversight is that “knowledge discovery” is not explained in the original text. It will be added in this round of revisions:

Revisions in Section 2.3:

Knowledge Discovery Databases (KDD) has become an important subject in modern management science. KDD is a complex process of identifying effective, potentially useful and understandable patterns from data sets. The general KDD system framework includes four steps: data preparation, data cleaning, data mining and data verification......

...The research of knowledge discovery in IBSN study is of great significance to strategic decision and investment decision. In listed companies, to avoid the possible litigation risks in the IPO process, the directors of the company will choose to dis-close the risks and reorganize information in narrative materials. IBSN users' tacit knowledge includes keen judgment on market orientation and company changes. Therefore, the knowledge discovery study of the UGC of IBSN platform is more helpful to discover this tacit knowledge...

Revisions in section 5.1:

5.1 Market feedback on risk attributes in UGC knowledge features

We run this section to discover the market feedback to UGC knowledge. Based on the results of Table 2 and Table 3, we transform the language features of UGC unstructured data into structured data and merge them with the companies’ financial data to form panel data. The basic idea of regression model is to compare the change rule of the fluctuation of stock returns before and after the release date of UGC. The change rule will be helpful to discover the strategic risks of listed companies, and to provide reference for investors' decision-making.

2. Response to comment 2): (Introduction must be improved to provide sufficient background)

Response: thank you for your evaluation of this part. On the basis of the original introduction, we have eliminated its dross and supplemented the necessary explanation. A revised introduction is marked in red in the manuscript.

Revisions in Introduction:

The development of the Internet has changed the way people learn knowledge. The way news is produced has also become more inclusive. Massive amount of User-Generated Content (UGC) is the main way of content production for more social media platforms. The application of UGC in industrial practice involves many fields such as economics, brand marketing, government management, and so on. Under the strategic investment background, to make decisions, investors look to public company news and user-generated opinions. Therefore, discovering knowledgeable information from social media user-generated content has been important to make effective strategic investment .

In the study of social learning and decision-making behavior, how to sufficiently acquire knowledge and how to read the market's feedback on such knowledge are of great significance to the screening of nodes in technological exploration. With the development of social media, investor social network has rapidly become a channel for the dissemination and exchange of investment knowledge. More targeted professional service platforms have emerged. Among them, the emergence of Investor Based Social Network (IBSN) not only enables independent investors to express their own in-vestment opinions but also enables financial researchers and analysts to collect a large number of investors' ideas. IBSN users include senior investors, financial experts, consultants, and other professionals, as well as company employees or common shareholders. These users' tacit knowledge includes keen judgment on market orientation and company changes. Therefore, the original analysis reports, reposts, and replies published by these users have an important research value.

However, UGC in finance is different from other areas. Firstly, UGC from the financial industry includes not only public word of mouth about the company's brand and products, but also include decisions shared by experienced investors. At the level of strategic management, the requirements of UGC knowledge feature extraction need to be improved. In addition to extracting effective features, more research should be done to evaluate if the contents contain high risk or low risk information. This requires additional identification of the risk attributes contained in the information content. Secondly, from the UGC communication effect, some UGC content in investors' social networks has a strong communication effect, especially for listed companies with higher risks. Investors are more inclined to pay attention to such content. Therefore, this paper puts forward the following research question:

Question 1: In investor social networks, what kinds of strategic reference does UGC can provide for listed companies?

Question 2: How to evaluate the market feedback on UGC knowledge features?

3. Response to comment 3): (the research design must be improved)

Response: Thank you for your comments on the research design. In the original manuscript, Chapter 3 is the general introduction of the research framework, Chapter 4 is the detail explanation for data, and Chapter 5 is the regression model and results. This leads to a very scattered research design. In the revised version, we readjusted the paragraph distribution. The new Chapter 3 Research Methods introduces the research framework and explains the methods and processing steps. The new Chapter 4 Empirical Research focuses on the question of how UGC can provide strategic reference for listed companies and how to measure the knowledge features on strategic perspective. The new Chapter 5 Stock Market Impact focuses on the question of evaluating the market response to UGC knowledge features. Many revisions have been made in the manuscript. For details, please see the red-marked parts of Chapter 3, 4 and 5.

Revisions in Chapter 3:

One of the research purposes of this article is to verify that UGC contains certain wisdom, which can get feedback from financial markets. Different from the operation process of the classic KDD, this study framework pays more attention to the feedback effect of the fluctuation of the excess return of the listed company's stock price on the fluctuation of the strategic risk evaluated in UGC content in data verification. As shown in Figure 1, this framework includes three processes: UGC data acquisition and natural language processing, LDA topic recognition and visualization, language analysis and empirical verification.

3.1 UGC data collection and natural language processing.

In this process, we will extract key information and store it locally for future re-search. See Section 4.1 The Data Source and Data Prepossessing for a detailed explanation of data acquisition and data processing. Conventional processing of natural language includes six steps:

Step 1: Exclude the following samples: articles, reviews, and posts from companies not listed on the US stock market; Samples with missing data on relevant variables; A sample of non-English expressions.

Step 2: To the ambiguity of unstructured data, at the prepossessing stage the sample was spell-checked to eliminate the content of spelling errors. Also, a stopword list is added, the list contains numerical content, expression characters, and other characters. Anything in the sample that is consistent with the list of stopwords will be eliminated.

Step 3: The reduction of some words omits no deeper meaning of the stop words.

Step 4: The quantification of lexical semantic relatedness has many applications in NLP, and many different measures have been proposed. All of them use WordNet as their central resource. A knowledge base in the form of WordNet's lexical relations is used to automatically locate training examples in a general text corpus. In this article, the title and abstract fields in the original analytical articles of UGC are extracted, and WordNet is used as the semantic-oriented English Dictionary.

Step 5: Dictionary operation: Through Python natural language processing, “gen-sim” is called to convert the document into vector mode according to the LDA model. The document set in “gensim” is expressed in the form of corpora, which is essentially a two-dimensional matrix format. In the actual operation, the number of words is very large (tens of thousands or even 100,000), and the number of words in a document is limited, so it will cause a great waste of memory using traditional dense matrix. After a document is partitioned into words, a dictionary is generated using “dictionary = corpora.Dictionary(texts)”. The “save” function then be used to persist the dictionary.  

Step 6: Use the pickle tool to train the corpus and save the generated lexicon in the document. The Spacy is an industrial-level Python natural language processing tool. Spacy makes extensive use of Cython to improve the performance of related modules, so it has practical application value in the industry. This article uses the functions of word tokenize in Spacy, including sentence breaking, stem extractions and part-of-speech tagging. The main goal is to restore English word forms so that they can be better used in machine learning.

3.2 LDA topic recognition and visualization

……

3.3 Language Analysis and empirical verification

……

4. Response to comment 4): (the results presented must be improved)

Response: Thank you for your comment. The results in the original manuscript do have some unclear expression. We summed up two kinds of problems and revised them one by one. First, some concepts in the results are ambiguous and have conflicting expressions. Second, the format of the results presented is unclear and imprecise. We redraw the figures and modify Table 1, Table 2, Table 3 and Table 4. We also clear the use of expressions, such as UGC, UGC Article, UGC Article Title, UGC Article Summary, UGC Stock Talk, investment analysis postings, stock forum, knowledge discovery databases, and so on. For these two kinds of problems, we have made modifications in the new manuscript.

Revisions to Section 4.1:

In the following paragraph, we use UGC Article as the measurement variable of user generated investment analysis postings. The contents of the investment analysis postings mainly include title, company, summary, analysis results and user information. In the following paragraph, we use UGC Article Title as the measurement variable of user generated investment analysis postings titles, and use UGC Article Summary as the measurement variable of user generated investment analysis postings summaries. Anyone can post on the SA, but it takes one day to review the user-generated investment analysis postings. In IBSN platform, registered users can comment on the other users’ postings or discuss freely in the Stock Talk Forum, as shown in Appendix Figure A2 (b). Stock talks have the advantages of timeliness , but its disadvantages are loud noise and insufficient rigor. Due to the different posting habits of users, some in-complete expressions often appear in talks. However, IBSN usually automatically tag related company names and user investment behaviors to stock talks. In the following paragraph, we use UGC Stock Talk as the measurement variable of user generated stock talks. Matching tags and stock talks can be helpful to classify the incomplete expressions and to improve the readability and credibility of NLP results.

Financial data comes from Wharton Research Data Services (WRDS) platform, which integrates Compustat, CRSP, TFN(THOMSON), TAQ and other famous data-base products. WRDS platform the leading business intelligence, data analytics, and research platform to global institutions. This article downloads the stock index and the relevant financial data of listed companies from WRDS-CRSP database. The research sample covers 2, 996 listed companies, 10, 386 UGC Articles, and 125, 247 UGC Stock Talks before data cleaning and data standardization.

According to the process of step 1, step 2 and step 3 in Section 3.1 UGC data collection and natural language processing, the result is 22, 192 observations.

Revisions to Section 4.3:

Table 1 and Table 2 respectively display the topic names, keywords and company names with high frequency under each topic. Due to the limitation of table length, the keywords listed in Table 1 are used to reflect the top six words with the highest weight distribution of the words, and the keywords listed in Table 2 represent the top ten words with the highest weight distribution of the stock talk words.

According to Table 1, the distribution of 20 UGC Article Title topics is relatively uniform, and the proportion of each topic is relatively close. Due to the rigorous and professional requirements for the UGC investment analysis postings, most topics involve at least one listed company name with high frequency. Company names usually appear in the form of company abbreviations, which makes it possible to identify companies’ nodes in natural language processing. For example, AstraZeneca is the company name with the highest frequency under the fourth topic (Undervalued Company). The contents listed in the last column of the table are the top two company names with the highest frequency in each topic.

According to Table 2, the distribution of 13 UGC Stock Talk topics shows a high and low trend based on their proportion distributions. For example, top three topics with the highest distribution are #5 Stock price movement topic, #6 Technology announcement topic and #12 China-U.S. trade tariffs topic. Different from the UGC Article Title, the frequency rate of company names in stock talks is lower than other keywords, so the names of listed companies are not shown in Table 2.

By comparing Table 1 and Table 2, three aspects can be discovered as follows.

(1) Topic proportion distribution difference. Generally, investment analysis postings have length requirements, and the posting format is strictly required. Such postings look more like analytical articles, too long or too short will be rejected. The postings should not be too oral and should meet a certain professional writing level and financial analysis ability. Users must agree to the disclosure standards and the editing services. Platform editors will not revise users’ ideas, but will polish the titles, abstracts and texts. Comparing the investment analysis postings, there is no format requirements for the stock talks. Repetitive words and expressions often appear in such talks.

(2) Topic issues difference. The topics of the UGC Stock Talk reflect the hot issues that ordinary users are concerned about. The stock talk contents are more specific and more direct. In contrast, the topics of the UGC Article Title reflect the strategic issues that the whole stock market is concerned about. The investment analysis posting contents are more macro and more comprehensive.

(3) Keyword stems difference. This article uses the functions of word tokenize in spacy, including stem extractions and part-of-speech tagging. Therefore, keywords in Table 1 and Table 2 are presented in the form of stems. In Table 1, many high frequency stems, such as "portfolio", "growth", "invest", appear in multiple topics at the same time. These keyword stems show that the style of the article conforms to the professionalism in the field of strategic management and the concise expression of the UGC Article Title. In Table 2, there are fewer repeated keywords in the UGC Stock Talk topics. It is not difficult to find that there are more oral expression stems under the UGC Stock Talk keywords, such as Bullish, bounce, bearish, stats, etc., which can reflect the "talk" attribute of the forum.

5.Response to comment 5): (the conclusions must be improved)

Response: Thank you for your comment. In the revised manuscript, in Chapter 6, we added the research contribution, made the research conclusion clear, and put forward the research deficiency. The English expression of this part has also been modified and polished.

Revisions to Chapter 6:

The main contribution of this research paper is to transform the unstructured content of UGC into the structured content with risk assessment label, and to test the effective feedback from the market by empirical study. The main conclusions of the article include the following four points:

(1) Topics commented by UGC mostly reflect the hot issues concerned by ordinary users.

(2) The UGC investment analysis postings (or the UGC Article Title) are more professional, and the UGC stock talks in the forum are more colloquial.

(3) Different UGC types lead to differences in topic recognition, language styles and knowledge characteristics. Therefore, subject identification and language analysis should be conducted separately.

(4) The effect of market feedback on UGC risk assessment knowledge features, which are measured by language analysis, is significantly different. The stronger the expressed risk characteristics, the lower the excess return of stock price. The empirical results show that due to the timeliness of stock talks, the market responds faster to the UGC Stock Talk……

 

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

The paper has value for the reader.

Back to TopTop