1. Introduction
Quantitative investment in stocks has long been a popular area of research. Quantitative investment techniques are designed to predict the direction of stocks through models, thus helping many retail investors make decisions that can lead to gains [
1]. Most previous stock forecasting methods treat the stock problem as a time series modeling problem and choose to utilize statistics-based methods to solve it [
2,
3]. In recent years, as the field of artificial intelligence has become hot, solving time series problems with machine learning-based methods has proven to produce better results [
4,
5,
6,
7,
8,
9,
10,
11]. However, machine learning methods have various limitations. Some research [
4,
5] uses support vector machines (SVMs) to solve time series problems. However, a SVM is difficult to implement for large-scale training samples. Research [
6] uses support vector regression (SVR) to solve time series problems. However, for linear data, the performance of SVR is slightly deficient relative to linear regression. Research [
7] uses extreme learning machines (ELMs) to solve time series problems. ELMs avoid local optima, overfitting, and time-consuming problems, but has limited application in multilayer structures. Research [
8,
9] uses correlation vector machines (RVMs) to solve time series problems. But when the dataset is larger, the training time of the RVM is too long. The heuristic algorithm (HA) is an innovative technique in the field of machine learning [
10], but it has a large dependence on parameters, and it is easy to fall into local optimal solutions. For the stock market, which has a large amount of complex data, traditional machine learning methods cannot consistently and accurately predict the direction of the stock market. Therefore, these traditional machine learning methods are not applicable to the stock prediction problem.
Deep reinforcement learning (DRL) is another method for building quantitative investment strategies. Its interactive trial-and-error learning characteristics are in line with the learning model of real-world organisms, i.e., the agent and the virtual financial market environment are constantly interacting, and the agent obtains feedback from the market by trying various trading actions, and then continuously adjusts its strategy. Thus, DRL-based methods have yielded good results in the field of quantitative investment. Research [
12] introduced a recursive deep neural network for real-time financial signal representation and trading. The aim is to train computers to beat experienced traders in financial trading. Research [
13] proposed a reinforcement learning framework without financial modeling that provides a deep machine learning solution for investment portfolio management tasks. Research [
14] proposed the theory of deep reinforcement learning applied in stock trading and stock price prediction and proved the reliability and usability of the model through experimental data. Research [
15] proposed two methods for stock trading decisions. Firstly, a nested reinforcement learning approach is proposed in this paper based on three deep reinforcement learning models. Then, a weighted random selection with a confidence strategy is proposed in the paper.
In addition, more and more stock prediction studies are incorporating sentiment analysis in their models. Research [
16,
17,
18] believes that online comments related to the stock market affect retail investors’ judgment, which in turn affects stock trading and the direction of the stock market. Research [
19,
20,
21] demonstrated that adding the sentiment analysis module can improve stock prediction accuracy.
In our previous work, we combined stock data and stock bar comment text and then built a stock investment portfolio management model using the DRL method. Although the experimental results show that large gains are obtained, the model does not adequately process information about the market environment, and the model data are not utilized efficiently, which leads to unstable results. Therefore, this paper addresses the shortcomings of the previous work.
Research [
22] proposes a multimodal DRL method that combines a CNN and LSTM as feature extraction modules and uses a DQN as a decision model for trading. This paper utilizes stock data information to generate images and then utilizes modules of a CNN and LSTM to process the images. The results show that the model achieves a significant increase in profit. Research [
23,
24] refers to previous work and improves on it by utilizing multimodal data to more fully extract stock features.
SRL is one of the related areas of research to make the model effective for training data. The basic idea of SRL is to construct auxiliary training tasks to train the feature extractor and place it in front of the actor and critic to pre-process the raw state and action inputs of the environment so that the actor and critic can speed up the training and improve the results with more efficient and easy-to-process intermediate features as inputs.
The innovations of this paper are shown below as follows:
A multimodal stock investment portfolio management model is proposed. Most of the previous stock investment portfolio management methods only considered stock raw data or added comment text data to raw stock data. The input data for the model proposed in this paper includes stock raw data, comment text data, and image data. We collected nearly 10 million stock bar comments, from 1 January 2022 to 3 April 2024, for 118 stocks. The stock raw data and the sentiment analysis index of comment text are used as state inputs. We use previous stock data to construct three kinds of image data that can reflect the long-term trend in the stock market. The image processing module consists of a CNN and LSTM, which extract the overall dynamic characteristics of the market, and the long-term time series change characteristics, respectively.
Adding an SRL module in the model. Most previous stock investment portfolio management methods have not extracted information from the raw data before inputting it, and even fewer have used SRL to extract information from the raw data. This paper adds the SRL module, which can help the agent to obtain more complete information about the stock market. The raw data first passes through the SRL module before it is input to the critic and actor layers.
Reinforcement learning algorithms based on the strategy gradient method are used. Among the existing RL-based stock investment portfolio management methods, there are more value-based RL methods. In this paper, we choose to use the policy gradient-based RL methods. The action space of the RL algorithm based on the policy gradient method is continuous and therefore more suitable for investment portfolio management tasks than the RL algorithm based on the value function method.
This paper consists of five chapters. The first chapter is the introduction. In this chapter, we introduce some methods to solve the task of stock investment portfolio management. We also introduce the benefits of DRL in solving stock investment portfolio management tasks. Finally, we present the innovations of this paper. The second chapter is preliminaries. We introduce the basic knowledge of DRL and the basic knowledge of quantitative investing in this chapter. The third chapter is the methodology. We describe our methodology in detail. The fourth chapter is about experiments. We first describe the source of the dataset and the preprocessing of the dataset, and then we conduct comparative experiments. We select the most appropriate SRL method and the most appropriate RL method. Our method is then compared with 11 other methods. On all three datasets, our method obtained the best results. The fifth chapter is the conclusion. In this chapter, we discuss the strengths and limitations of our methodology and discuss the focus of future work.
4. Experiments
In this section, we introduce the dataset and perform comparative experiments. In
Section 4.1, we describe the dataset sources. In
Section 4.2, we describe the data preprocessing. In
Section 4.3, we perform comparative experiments. In
Section 4.3, we first select the most appropriate RL method and then the most appropriate SRL method. Finally, we compare our method with 11 methods.
4.1. Dataset
The Chinese stock market is one of the most active and largest financial markets in the world today, with strong representativeness and research value. We chose the Chinese stock market as our dataset for the following three reasons:
For many years, most of the research on the stock market has used the US stock market as the dataset, and there is less research on stock prediction in the Chinese stock market.
The Chinese economy recovery is accelerating, and the total market capitalization of the Chinese stock market has exceeded RMB 10 trillion in 2020. The Chinese stock market is large and representative enough to be the subject of research.
The US stock market is dominated by institutional investors, while the Chinese stock market is dominated by retail investors. From the inception of the Chinese stock market to 2023, the number of retail investors has exceeded 220 million. This makes the Chinese stock market highly predictable.
In addition, the dataset has limitations. The development history of the Chinese stock market is not long enough. Some companies have too short of a development history, resulting in missing stock data. Some companies have fewer investors, resulting in fewer stock bar comments. These companies cannot provide valid data for the experiment and therefore need to be excluded. We excluded such stocks in the data preprocessing section. This paper selects 118 companies on the list of the top 150 listed companies in China for two consecutive years, 2022–2023, as the source of the stock trading dataset, based on the list of the top 150 listed companies in China published by the Walton Institute of Economic Research. The list of these 118 companies is placed in the
Appendix A (
Table A1).
East Money (
https://www.eastmoney.com/) is one of the most visited and influential financial and securities portals in China. Since its launch in March 2004, East Money has always insisted on the authority and professionalism of its content, covering various financial fields in a multifaceted way, and updating tens of thousands of data points and information on a daily basis. Therefore, this paper chooses the stock bar under this website as the source of the dataset for sentiment analysis.
The SSEC is one of the most authoritative stock indexes in China. This is because companies listed on the Shanghai Stock Exchange, which are usually industry mainstays or even industry leaders, can have their stock prices change in a way that reflects broad changes in China’s stock market and can also influence the stock market. Therefore, the image data added to the model by this paper was plotted from the SSEC stock data.
4.2. Data Preprocessing
We use crawling techniques in Python to crawl posts from the comments section of a stock bar. We collected all posts in the stock bar for each day, from 1 January 2022 to 2 April 2024, for 118 stocks. These posts contained nearly 10 million comments, which we processed into text form and then used SnowNLP to obtain a sentiment analysis index. The sentiment analysis index is a number in the interval [0, 1], where a higher number indicates a more positive sentiment expressed in the language. SnowNLP is an official Python library inspired by TextBlob and written to make it easier to deal with Chinese. SnowNLP uses Bayesian machine learning methods to train on text. When text is fed into SnowNLP, the output is between 0 and 1. The closer the output number is to 1, the more positive the sentiment of the text. SnowNLP has a high correctness rate in dealing with the Chinese language. In this paper, we use the sentiment analysis module of SnowNLP to process the posts of stock bars in our dataset. We define the average of the sentiment analysis for each day of the 118 stocks as
, where
denotes the company and
denotes the date. Suppose a company is called
and today’s date is
. We collected all the comments under the stock bar of
on the day
. Suppose there are 100 of these comments. We fed these 100 comments into SnowNLP and obtained 100 sentiment indexes. These sentiment indexes range from 0 to 1, with higher values representing more positive sentiments. We then average these 100 indexes. This average of the sentiment indexes is
. The specific process is shown in
Figure 5. We collect reviews of companies from stock bars, and the reviews are fed into SnowNLP. The average of the sentiment analysis index of all comments from company
on day
is
.
Among these 118 stocks, there are some stocks with missing data and some with too few comments. Therefore, we excluded these stocks. We eliminated 28 stocks with a low number of comments, and then processed the average of all sentiment analysis indexes (As) for the day for each of the remaining stocks together with the raw stock data into multiple tables. Part of the table is shown in
Table 2 (using the stock 000001 as an example). A “Dataframe” represents a table and is a spreadsheet-like data structure. “Dataframe” contains a sorted set of lists, each column of which can have a different type of value. “Dataframe” has row and column indexes; it can be seen as a dictionary of “Series”. The input to our method proposed in this paper is “Dataframe”.
Table 2 shows what the inputs to the method look like.
4.3. Comparison Experiment
We divide the 90 stocks into three datasets, and the three datasets are called Group A, Group B, and Group C, respectively. There are 30 stocks in each dataset. As shown in
Figure 6, we use data from two years, 2022 and 2023, as the training set, the data of January 2024 as the validation set, and the data from February to March 2024 as the test set.
For all the comparison experiments conducted next, the model structure and hyperparameters of all methods except MDOSA (in subsequent experiments, we call the model proposed in this paper MDOSA) used the settings in the original paper but were consistent with MDOSA in key parameters such as batch_size and n_updates. The hyperparameter settings for the MDOSA model are shown in
Table 3.
Please note that we used Python version 3.7.13 and PyTorch version 1.13.1. If the versions of Python and PyTorch are different from those in this paper, there may be a large discrepancy with the training results in this paper.
4.3.1. Comparative Experiments with RL Algorithms
In
Section 2.1.2 of this paper, we presented four policy gradient-based RL algorithms. Each of these four algorithms has its own advantages and has different areas of application. In order to find the algorithms that are most applicable to the dataset of this paper, we tested the four algorithms on three datasets. Please note that in this comparison experiment, we did not include image data in these four RL algorithms, only textual sentiment analysis data in the states.
We have already introduced the evaluation indicators in
Section 2.2.2 of the paper. Because of the randomized nature of RL training, we trained each algorithm six times, and then present in the table below the mean and standard deviation of the six trainings for the three indicators. The standard deviation of all three indicators is as small as possible. Except for the max drawdown, the average value of all other indicators is as large as possible. Of the three indicators, the Sharpe ratio is used to measure the excess return per unit of risk taken and represents the cost effectiveness of the investment. Therefore, it is the indicator we prioritize. The test results in Groups A, B, and C are shown in
Table 4. The first place for each indicator is marked in red.
Since the names of the indicators are too long, in the first row of the tables, we call the average of cumulative returns, the standard deviation of cumulative returns, the average of the Sharpe ratio, the standard deviation of the Sharpe ratio, the average of the max drawdown, and the standard deviation of the max drawdown as ACR, SDCR, ASR, SDSR, AMD, and SDMD, respectively. We continue to use the full name of each indicator in the subsequent analysis.
We analyze the results as follows:
- (1)
Cumulative returns: TD3 has the largest average of cumulative return in both Group A and Group C, and DDPG has the largest average of cumulative return in Group B. DDPG has the smallest standard deviation of cumulative returns in both Group A and Group C, and TD3 has the smallest standard deviation of cumulative returns in Group B. This shows that both DDPG and TD3 are able to have strong stability while obtaining high returns.
- (2)
The Sharpe ratio: TD3 has the largest average of the Sharpe ratio in Group A, while DDPG has the largest average of the Sharpe ratio in both Groups B and C. Also, DDPG has the smallest standard deviation of the Sharpe ratio in all three datasets. The standard deviation of the Sharpe ratio for TD3 was particularly large in Groups A and C. The complexity and erratic direction of data in the stock market make solving investment portfolio management tasks with the TD3 algorithm not a good choice for investors who want to benefit with low risk. As a result, DDPG performs best in terms of the Sharpe ratio, the indicator that is prioritized for comparison.
- (3)
The max drawdown: The performance of the four algorithms in terms of the max drawdown is not very different. DDPG demonstrates a moderate level of competence in this area.
Thus, on a comprehensive basis, DDPG is the most appropriate algorithm.
4.3.2. Comparative Experiments on SRL Models
We have selected DDPG as the algorithm for agent training, and the SRL model is selected next. Please note that in this comparison experiment, we added not only textual sentiment analysis data but also image data.
In
Section 3.2.1 of this paper, we introduced three SRL models. Each of these three models has its own merits, and in order to find out which model is most suitable for the dataset in this paper, we tested the three models on three datasets. We uniformly use DDPG as the training algorithm, while each SRL model uses a two-layer structure. The evaluation indicators are handled in the same way as in the RL algorithm comparison experiments. The test results in Groups A, B, and C are shown in
Table 5. The first place for each indicator is marked in red.
We analyze the results as follows:
- (1)
Cumulative returns: OFENet has the largest average of cumulative returns in all three datasets and has the smallest standard deviation of cumulative returns in both Group B and Group C. This shows that DDPG has strong stability while earning high returns.
- (2)
The Sharpe ratio: OFENet has the largest average of the Sharpe ratio in all three datasets and has the smallest standard deviation of the Sharpe ratio in both Group B and Group C. This shows that OFENet is able to consistently achieve a high cost effectiveness of investment.
- (3)
The max drawdown: OFENet also achieves a better score in this aspect of the max drawdown.
Obviously, OFENet is the most appropriate SRL model. Since for the SRL module, we used the OFENet model and for the RL algorithm we used the DDPG algorithm, we called the model the multimodal DDPG model combining OFENet and sentiment analysis, or MDOSA for short.
4.3.3. Comparative Experiments of All Methods
In the past work, previous authors have designed many solutions for the investment portfolio management task using statistics-based methods. Therefore, we selected five statistics-based methods to compare with the MDOSA model in this paper [
13,
33]. In the field of stock prediction, there are many methods regarding investment portfolio management tasks. We have reviewed a large number of references and have found that there are five methods that are preferred. The five methods are the best constant rebalanced portfolio, best stock, uniform buy and hold, uniform constant rebalanced portfolio, and the universal portfolio. All five methods are applicable to the stock investment portfolio management task. Research [
13] mentions the best constant rebalanced portfolio and universal portfolio in the paper and the experimental results show that these two methods have good performance. Therefore, these two methods are used in this paper as the benchmark methods. Research [
33] mentioned the best stock, uniform buy and hold, and the uniform constant rebalanced portfolio in the paper. These three methods obtained good results in the paper and therefore, these three methods are used as benchmark methods in this paper. These five methods often appear in papers as benchmark methods in research related to the stock investment portfolio management task. Therefore, we choose these five methods as the benchmark methods in this paper. A brief description of these five methods is shown below as follows:
- (1)
Best constant rebalanced portfolio: using the given historical returns, find the optimal asset allocation weights.
- (2)
Best stock: choose the best performing stock based on historical data and simply invest in that stock.
- (3)
Uniform buy and hold: funds are evenly distributed at the initial moment and subsequently held at all times.
- (4)
Uniform constant rebalanced portfolio: adjust the allocation of funds at every moment to always maintain an even distribution.
- (5)
Universal portfolio: the returns of many kinds of investment portfolios are calculated based on statistical simulations, and the weighting of these portfolios is calculated based on the returns.
For convenience, we abbreviate these five methods as BCRP, BS, UBH, UCRP, and UP, respectively. In addition, we include all four RL models, DenseNet (based on DDPG), and D2RL (based on DDPG) in the comparison experiments. The test results for the three datasets are shown in
Table 6,
Table 7, and
Table 8, respectively. Please note that there is no randomization in the five statistics-based methods, so the standard deviation of all three indicators is zero. Similarly, the first place for each indicator is marked in red.
We analyze the results as follows:
- (1)
Cumulative returns: MDOSA has the largest average of cumulative returns on all three datasets and has the smallest standard deviation of cumulative returns in Group B. DDPG has the smallest standard deviation of cumulative returns in both Group A and Group C. This shows that MDOSA has the highest yield and that both it and DDPG are strongly stable.
- (2)
The Sharpe ratio: MDOSA has the largest average of the Sharpe ratio in all three datasets and has the smallest standard deviation of the Sharpe ratio in both Group B and Group C. This shows that MDOSA has the highest cost effectiveness of investment and has strong stability.
- (3)
The max drawdown: MDOSA performs moderately well in this area.
It is obvious that MDOSA has the best combined performance in the three datasets. In order to see the characteristics of each method more clearly, we use radar charts to demonstrate them. The radar charts on the three datasets are shown in
Figure 7,
Figure 8, and
Figure 9, respectively. All indicators have been max–min normalized. Additionally, for intuition, we multiply the smaller-is-better-indicators by
so that all indicators are larger-is-better. Please note that the standard deviations of the three indicators for the statistics-based methods are not processed.
5. Conclusions
In this paper, we develop a multimodal model for the stock investment portfolio management task (MDOSA). We utilize SRL to process the raw environmental information and input textual sentiment analysis data for stock reviews and image data representing the overall direction of the market to the agent. The RL algorithm based on the policy gradient is chosen as the algorithm for the training of agent. The biggest advantage of our method over previous methods is that we obtain high return rates while maintaining strong stability. Most previous methods could not guarantee strong stability and high return at the same time. The method in this paper was tested on three datasets with the goal of verifying strong stability. There are methods that may achieve high return on a certain dataset but fail to achieve high return on a different dataset. Our method obtains high return on all three datasets. In addition, our method is highly cost-effective. The Sharpe ratio represents the cost effectiveness of a particular investment portfolio. On all three datasets, our method has the largest Sharpe ratio. We compared MDOSA with the other 11 methods, and MDOSA generally outperformed the other 11 methods.
The real-life constraint of transaction costs is considered in this paper. Other constraints, such as liquidity and bid–ask bounce, are not considered in this paper. Bid–ask bounce refers to the adjustment phenomenon in which the stock price is in a continuous downward trend and eventually reverses back up to a certain price level because the stock price is falling too fast. These constraints can have implications for stock market trading. This is one of the limitations of the current research and we will improve our methodology in future research to address this issue. The max drawdown measures the ability of a method to resist risk in a market downturn. From the experimental results, MDOSA did not demonstrate the strongest risk resistance in terms of the max drawdown. In our future work, we will continue our research to find ways to make MDOSA have strong risk resistance in the market downturn. In the next study, we will consider the relationship between stocks. This is because considering the relationship between stocks enhances the risk tolerance of the method. There are various relationships between many companies. Some relationships are competitive, some are mutually beneficial, and some are affiliations. Thus, a change in one company’s stock may change the direction of some other company’s stock. In our future work, we will build a network of stock relationships to obtain stock prediction methods with better results.
Explainability in a broad sense means that we have access to enough comprehensible information that we need when we need to understand or solve something. Explainable deep learning and explainable reinforcement learning are both subproblems of explainable artificial intelligence, which are used to enhance human understanding of models. In this paper, we use the methodology in DRL. We know that a good explanation of the basis on which the model generates decisions will help investors understand our methodology. However, so far, there is no unified explainability method for explainable deep learning, and there is no unified explainability method for explainable reinforcement learning. The issue remains challenging, but it is certainly one that needs to be addressed. In future work, we will research the area of explainability. We will try to explain clearly the basis on which the agent generates decisions from within.