1. Introduction
Asset exchange can be very profitable if it is done properly. The values of typically traded assets, such as cryptocurrencies, fluctuate constantly due to the interactions of the participants in markets. Therefore, if it can be predicted which assets have high chances of gaining value, they can be acquired and sold off in the future, generating profits to the investor. Profits can also be generated when the value of assets decreases; short selling and margin trading services work under this premise. Predicting which assets have the potential to increase their value is difficult. This is because trading is linked to human sentiment, which changes rapidly when important global events occur, for instance, political elections, international agreements, or natural disasters, and also it is easily manipulated using branding or marketing campaigns [
1]. Nonetheless, asset trading is commonly practiced by both humans and algorithms throughout the world, where the most commonly traded assets are stocks, fiat money, and cryptocurrencies.
In the last decade, cryptocurrencies have gained worldwide relevance despite the initial skepticism of the people towards these assets. More and more countries are allowing the use of cryptocurrencies as a payment method, and in addition, in June of the present year (2021), the republic of El Salvador became the first country to adopt Bitcoin as legal tender.
https://www.nytimes.com/2021/06/09/world/americas/salvador-bitcoin.html (accessed on 6 August 2021). However, the price of Bitcoin has been extremely volatile throughout its history, and widely different compared to traditional assets. Therefore, to overcome the shortcomings of this cryptocurrency, hundreds of alternative tokens, known as ‘altcoins’, have sprung up around the world [
1], which in turn has made the number of platforms offering cryptocurrency services, such as trading, grow. These platforms have significant advantages against those for other types of assets, with the most important being access to the common public and a minimal amount of cash required to start trading.
We propose a DRL approach for cryptocurrency trading, where a self-attention (SA) network is the backbone of the system. Cryptocurrency markets contain hundreds of assets that can be exchanged. Therefore, being able to process massive amounts of data allows the system to gain a better understanding of the market state and the individual state of the assets, which in turn produces better diversification of investments and reduces risk [
2].
This work also explores the problem of inter-asset transactions (IAT). Most markets require a fee for each transaction made on their platforms, and cryptocurrency markets are no exception. Therefore, a tool to compute the transactions that leads to the minimum possible fee expenditure is important. This issue is even more significant if the trading frequency is high, for example, if an investor trades every six hs a cryptocurrency volume worth 10,000 USD with a fixed transaction fee of 0.1%. Then, after only one month the total expenditure would be 1200 USD due to transaction costs, which is a non-trivial amount compared to the traded volume. We study this problem in-depth and propose three algorithms to perform optimal IAT for different applications.
The main contributions of this work are summarized below.
A self-attention NN architecture for cryptocurrency trading is proposed. The NN is trained using DRL and is independent of the number of assets in the market.
An analysis of the information flow inside the NN is given, explaining how and why the architecture works.
The problem of IAT is studied in depth. This includes the description and theoretical justification of three algorithms, and the evaluation of the speed and accuracy of the algorithms.
The rest of the paper is organized as follows.
Section 2 discusses the relevant points of related literature.
Section 3 presents the mathematical formulation of the problem.
Section 4 describes various algorithms that allow the computation of optimal transaction fees.
Section 5 describes the deep neural network (NN) architecture used for cryptocurrency trading as well as the training method.
Section 6 contains the setup of the experiments implemented to evaluate the performance of our method.
Section 7 presents the results of the experiments along with a discussion of the most relevant findings. Finally, conclusions are drawn in
Section 8.
2. Literature Review
One of the earliest studies on reinforcement learning applied to asset trading can be found in [
3]. In that work, the authors proposed an algorithm named recurrent reinforcement learning (RRL) applied to stock market trading. The authors explored the use of well-known financial measures, such as the Sharpe ratio, as reward functions in the process. Those ideas were expanded in [
4], where they compared their approach to other methods. Later on, Gold [
5] used a similar approach for currency trading in the foreign exchange (FX) market and focused on comparing different neural networks (NNs) to obtain an optimal architecture for the problem. Dempster and Leemans [
6] proposed adaptive reinforcement learning to the FX market. The method is a complex system, in which decisions are based on performance, risk measures, and a dynamic optimization process. Maringer and Ramtohul [
7] proposed a threshold RRL method, which uses two networks trained in tandem to deal with two different market states, one for a volatile market and the other for a non-volatile one. They argued their approach outperforms the original RRL because it is difficult to model the entire behavior of a market with a single network. Zhang and Maringer [
8] also based their approach on RRL and showed the results of technical and fundamental analysis can be used as inputs for the neural network along with price data. However, the approach requires human expertise to handcraft the indicators to be fed into the network.
More recently, due to the advancement of NN techniques, training deeper and more powerful NNs using reinforcement learning has become feasible [
9], making architectures such as convolutional and recurrent networks popular in asset trading research. Deng et al. [
10] proposed recurrent neural networks (RNNs) to trade stocks and used fuzzy logic to determine the trending state of the market. Jiang and Liang [
11] used a policy gradient method to train a convolutional neural network (CNN) for cryptocurrency trading. Bu and Cho [
12] proposed a two-step approach for cryptocurrency trading with a pre-trained deep RNN using a Boltzmann machine, and they trained it using the Double Q-learning algorithm, reporting positive returns even when the market state did not have an increasing tendency. Pendharkar and Cusatis [
13] compared the performance of multiple DRL techniques for stock market training. Liang et al. [
14] compared multiple DRL methods for asset trading as well, but they proposed adversarial training to all the studied methods to improve model performance. Jeong and Kim [
15] used Q-learning to train a deep NN to trade stocks. In addition, they applied transfer learning to the case where a low amount of data is available. Wu et al. [
2] studied the effects of gated recurrent units (GRUs) [
16] as time series feature extractors for trading a single asset in a stock market. They compared two different approaches: deep Q-learning and policy gradient [
17]. They found both methods are appropriate for asset trading and concluded trading a single asset is risky and diversifying investments should be preferred. Aboussalah and Lee [
18] proposed a method named stacked deep dynamic reinforcement learning (SDDRL) for real-time stock trading, and argued the selection of the appropriate hyper-parameters is especially important in this type of problem. To deal with this issue, they proposed a Bayesian approach for hyper-parameter tuning. Park et al. [
19] proposed an LSTM [
20] network trained with deep Q-learning for stock market trading. Lei et al. [
21] proposed a similar approach using a GRU network trained with Policy Gradient.
In a previous work [
22], we proposed a DRL method for cryptocurrency trading, which can adapt to new assets introduced suddenly to the market. This work follows that line of study and tackles some important issues that were not covered in the mentioned study. One of the differences between this study and [
22] is the architecture of the NN. In [
22], the combination between the extracted features from the assets is a softmax layer. That layer simply normalizes the values proposed by the feature extractor to generate the final output. To allow a real exchange of information between the features of the assets, we integrate a series of self-attention layers into the new architecture. These layers take the information from each asset one by one, and once all the data is received, it generates the output corresponding to each asset. Therefore, the new blending mechanism has a general overview of the market as well as the individual information of the assets, allowing it to generate more reliable outputs. Another difference between this work and [
22] is the inclusion of IAT in the asset exchange process. This allows the new method to spent the optimal amount of capital during the transactions, which is an important issue when assets are exchanged constantly. The proposed methods for IAT are covered in detail in
Section 4.
Table 1 summarizes the main contributions of the discussed works.
3. Problem Formulation
A trading session is the total amount of time in which an agent is allowed to conduct transactions to generate profits. The session is divided into equal-length steps, where at each of them, the following steps take place: selection, exchange, and waiting. The selection step is carried out by a neural network, which computes a non-negative scalar for each asset, representing the weight or percentage of that asset with respect to the entire set of investments to be held in the coming period. The design of the NN used in the selection step and its training procedure is described in detail in
Section 5. After the selection process, some of the assets held by the agent have to be exchanged to obtain the assets chosen by the selection procedure. The transaction fees pre-established by the trading service provider are considered in computing the quantities to be exchanged. The algorithms developed for computing the transactions that lead to the lowest possible transaction fees are described in
Section 4. After computing the transactions, these are executed in the market and a waiting period begins. This period allows the investments to mature and generate profits or losses to the agent. Once the waiting period finishes, the process moves to a new step, beginning with the selection of new assets for the new period. The process is repeated until the trading session is complete.
Let an asset in the market and a step in the trading session be i and k, respectively. The shares corresponding to each asset i held by the agent at the beginning of step k are denoted by . If there are N assets in the market, then . The index 0 is reserved for the cash asset, whose value is assumed to be constant throughout the trading process. Therefore, the asset is used as a reference, i.e., the shares of all the assets () are measured in units of this asset. In stock markets, the USD is used for this purpose, but in cryptocurrency markets, the most popular choice is the USDT. The sum of for all i, denoted , is the total capital of the investor at the beginning of step k. The variables corresponding to the shares held by the agent at the end of period k are denoted similarly; represents the individual asset shares and is their sum. The trading process can be summarized as follows. The investor begins the trading session with a capital distributed among some assets . Then, the neural network suggests some suitable assets for period , and some assets are exchanged. The result of those transactions is the . Then, a waiting period follows, allowing the values of the assets to evolve into , reaching the end of the step, where a new set of assets is chosen by the NN for period . This process is repeated until reaching the final step K, where the performance of the agent is computed using the formula: , which is the difference between the final and initial capital.
The transition between period
k and
consists of a set of transactions summarized in Equation (
1), where both
i and
j represent each integer from 0 to
. Thus, Equation (
1) is in fact a system of
N equations, one for each asset. Note, for simplicity, the upper limit in the summation symbols has been dropped; the upper limit in all expressions is
unless otherwise specified. The amount transferred from asset
i to asset
j and the fee for a transaction is written as
and
, respectively. Equation (
1) can be interpreted as follows. The shares of asset
i after executing the transactions (
) are equal to the shares before the transactions (
), minus the shares transferred from that specific asset to each of the other asset (
), minus the fees spent for transactions (
), plus the shares received from the rest of assets (
). Note, all the variables in Equation (
1) are non-negative numbers.
Then, since the output of the NN is the relative weights of the assets with respect to the total capital, Equation (
1) has to be modified to contain the asset weights instead of the shares. This is accomplished by dividing Equation (
1) by
, as in [
22], resulting in Equation (
2), where the old and new weights are denoted by
and
, respectively,
are the transfers with respect to the weights, and
is a rate representing the capital lost due to the transactions. The definition of these variables is shown in Equation (
3).
Note, the sum of all the weights
equals one, and the same is true for the weights
. Therefore, by adding the
N expressions represented by Equation (
2), i.e., from
i equals 0 to
, and simplifying the result, we arrive at Equation (
4), providing an easy way to evaluate the discount rate
that becomes one minus the fees due to all the transactions. Therefore, it can be used as the objective function of an optimization process to compute the optimal exchange amounts. Note, since it is not possible to transfer any amount from an asset to itself, both
and
are assumed to be zero, leading to Equation (
4).
The optimal solution for the exchange fee minimization problem must satisfy Equation (
2) for all
and maximize
in Equation (
4). In addition, the solution has to satisfy Equation (
5) for all
i as well. The set of inequalities in Equation (
5) simply states the sum of the transferred amounts from one asset to the rest cannot exceed the amount held of that specific asset at that moment. The process described in this section is summarized in
Figure 1.
5. DRL for Asset Trading
DRL is a type of machine learning combining deep learning with a DRL design to enable agents in a given state to take actions that maximize their rewards in a virtual world. State, action, and reward are three key concepts of reinforcement learning. At each state, the agent observes its surroundings and takes an action based on the observation. Then, the state of the environment changes to a new state and the agent receives a numerical reward based on its actions. The changes between states are driven by both the environment dynamics and the agent’s actions. Thus, good actions are those producing high rewards or driving the environment into states where the expected reward can be maximized. The process is complete when a maximum number of states has been visited or a stop condition is met. The period from the initial state to the terminal state is called an episode. The sum of all the rewards collected by the agent during an entire episode is the total reward. Therefore, the purpose of reinforcement learning is to train agents to learn a policy that can make decisions to produce the maximum possible total reward in a specific environment.
In our case, the environment is the cryptocurrency market, the agent is the investor and the rewards are the profits generated in the process. An action in our environment is the execution of the transactions needed to acquire the assets suggested by the NN. The NN observes the environment state, i.e., market prices and volumes, and outputs a set of suggested initial asset weights for the subsequent period of the process. Then, the fee minimization algorithm computes the necessary transactions to acquire the assets suggested by the NN with the minimal possible transaction fees. Next, the transactions are executed, allowing the market to evolve into a new state. This process is depicted in
Figure 1. The rewards received by the agents are the earnings or losses received at the end of each step due to the chosen transactions.
Our NN is not limited by the number of assets in the market. The initial layers of the NN process the data of each asset individually. Then, the generated feature vectors are fed one by one to a series of self-attention layers which produce the initial asset weights for the subsequent step of the process. Self-attention networks receive and output sequences of data. This characteristic allows our NN to process a large number of assets and continue working even if new assets are added to the market, which happens often in cryptocurrency markets.
Our agent is allowed to perform daily transactions. Thus, an episode in the environment consists of a specific number of days, in which the agent is allowed to buy and sell assets at every 24-h time interval. The state of the environment consists of the current prices, capitalizations, and volumes of the assets in the market. Of which, the agent is allowed to observe the data corresponding to the latest 20 days. The action consists of the set of transactions made by the agent before a 24-h waiting period, and it is executed in two steps: asset selection (done by the NN) and transaction fee minimization (done using Algorithm 1). The reward of the step is the capital gained or lost by the agent after the waiting period.
5.1. Implementation Details
The environment was built using data from the cryptocurrency market: Binance
www.binance.com (accessed on 3 November 2020), and it corresponds to the period: August 2017 to November 2020. The features of the dataset are summarized in
Table 2. The dataset was divided into two parts: training and test, where the test dataset is the last year in the dataset. Note, the total number of financial indicators in the dataset is nine, but only six of them were used: open, close, high, and low prices, plus volume, and the number of transactions per sampling period.
5.2. The Deep NN for Asset Trading
An NN processes numeric data of the market’s assets, such as prices and volumes, and outputs the distribution of assets to be held in the subsequent periods of the process. The initial layers of the network extract independent high-level features for each asset, and a set of self-attention layers [
24] blend that information to produce the output. The components of our NN are shown in
Figure 5. The data fed to the NN is an arrangement of financial indicators with dimensions
, where
F is the number of financial indicators,
N is the number of assets, and
K is the number of past steps the agent is allowed to observe. The number of financial indicators equals six, as mentioned in
Section 5.1.
F is considered to be the number of channels of the input data. The number of assets in the dataset increases over time from 3 to 227 (see
Table 2); therefore,
N is not a constant. The number of past steps (
K) depends on two factors: the number of days and sampling time. The data used in our experiments has a sampling time of 30 min, and the number of days our agent is allowed to observe is 20; therefore,
K equals 960.
Since the channels of the data have a very different range of values, normalization was applied before feeding to the NN. The normalization consists of subtracting the mean from each channel and dividing the results by the standard deviation of the channel, resulting in a normal distribution with zero mean and unit standard deviation. Then, for each asset, the normalized inputs are independently passed through a sequence of two convolutional layers, which are the primal feature extraction mechanism of the NN. The extracted features are flattened and passed through a fully connected (FC) layer to produce a set of feature vectors. The feature vectors are directed one by one to the blending mechanism of the network, consisting of a set of three self-attention encoders [
24].
The design of the self-attention encoders used in our NN is shown in
Figure 6a. The self-attention encoders consist of two different sub-layers. The first sub-layer is a multi-head self-attention block, meaning it contains parallel feature extractors (heads), which generate independent feature vectors from the inputs. The number of heads in our self-attention blocks is eight. The distinct feature vectors generated by the heads are combined into a single feature vector using a linear layer, as shown in
Figure 6b. Then, these results are fed to a normalization layer [
25], which has a residual connection [
25] to generate the output of the first sub-layer.
Each head in a self-attention encoder has three matrices: query (
), key (
), and value (
), which are the parameters to be trained. When a self-attention encoder receives the feature vectors corresponding to the assets, the feature vectors are concatenated and multiplied by the matrices
,
, and
to obtain the matrices
Q,
K and
V, which are used to compute the output of the self-attention head using the Equation (1) of [
24], which computes the amounts of attention having to be given to all the assets when generating the output of each asset.
The second sub-layer of the self-attention encoder also has two blocks. The first block is a feed-forward NN with two layers. The number of nodes in each layer is 400. The second block is a normalization layer with a residual connection, which produces the final output of the self-attention encoder. We use three identical self-attention encoders placed in sequential order as the blending mechanism of our design. The output of the blending mechanism is a high-level feature vector for each asset, containing information about every asset in the market. The generated feature vectors individually go through an FC layer to output a single scalar for each asset. These scalar values are normalized by a softmax layer to obtain the asset weights to be held in the next period.
We use an actor-critic design for the agent to learn an optimal policy that maximizes cumulative rewards. Therefore, the NN has two outputs: actor and critic. The actor determines which action to take and the critic evaluates how good the action was. The value function consists of another FC layer that takes the output of the self-attention layers and produces a scalar for each asset. Then, the output of the value function is simply the average of the scalar values.
5.3. Training
Our agent was trained using the ‘synchronous’ version of the Asynchronous Advantage Actor-Critic method (A3C) [
26], introduced by Open AI, known as A2C.
https://openai.com/blog/baselines-acktr-a2c/ (accessed on 2 August 2021). Therefore, the NN has two parts: actor and critic. The actor, also known as the policy, receives environment states and outputs actions. The critic, on the other hand, predicts the total reward the agent will receive at the end of the process by following the actions given by that specific policy. The policy is represented by
and the value function is represented by
. In our implementation, both the actor and critic are combined into a single NN with two independent output layers. Hence, they are trained simultaneously applying Adam optimizer [
27] to the cost function shown in Equation (
8), where
is the state,
is the action taken in that state,
is the reward received,
A is a variable named the advantage, and
R is the total discounted reward (see [
26]). The values of the parameters used in the optimization process are listed in
Table 3. The values were chosen by comparing the performance of training sessions, which used hyperparameter values in the ranges shown in
Table A1. We used 16 workers for data collection. Thus, in each iteration, we randomly draw 16 days from the training dataset to be the initial days of each trading session, and the workers execute trades for four days. Then, this data is used to compute the cost function, which in turn is used to improve both the policy and value function together.
5.4. Layer Analysis
In this subsection, we show how the information flows through the NN layers and give an interpretation of the meaning of the features computed by some of the layers of the NN. Recently, Makridakis et al. [
28] made a critic about the performance of machine learning approaches on time-series forecasting tasks, which are closely related to our application. They state that ML researchers tend to over-utilize resources on tasks where simpler statistical methods can produce better results. Therefore, to show the relevance of each of the components of our NN, we fed our NN with the data of four popular cryptocurrencies drawn randomly from the dataset and analyzed the results computed by different layers.
The input layer of the NN is a normalization layer. The result of the normalization operation is shown in
Figure 7. Both graphs are practically the same. However, an important difference is that, in the normalized data, most of the values corresponding to LINK are negative. This is useful as those values can be ruled out by the ReLU operation that takes place after the convolutional layers. Additionally, it has been shown that normalizing the inputs makes the training process faster and more stable [
29].
The next group of layers is a series of convolutional layers. These layers have two functions: feature extraction and data size reduction.
Figure 8 shows the features extracted by one of the convolutional masks of the first convolutional layer. As shown in the figure, this layer highlights assets that have significant positive changes in value, which is a typical convolution operation. We can see, as expected, that the features of the asset in green have been flattened out almost completely. These simple features are combined in deeper layers into abstract complex features which are given to the self-attention layers.
The final part is a sequence of self-attention layers, which choose the assets to be acquired in the subsequent step of the process. The results of the heads of the last self-attention layer are shown in
Figure 9. The figure shows that most of the heads in the last layer focused their attention on Bitcoin, Ethereum, and Litecoin, which are the cryptocurrencies that increased the most their value in the tested sample. This shows that the layer can process the features extracted by the convolutional layers and determine the assets with higher chances to continue being in an increasing tendency.