RL Short Survey & Trading Example
🤖

RL Short Survey & Trading Example

Tags
Q-learning
Policy Optimizaiton
gym
Stable-Baselines3
Tianshou
Description
Introduction to basic reinforcement learning algorithms with an example pipeline of application in trading
TOC

0. Basic Idea

notion image

1. Environment

1.1 Create an “Ideal World”

Take trading as an example. In this circumstance, the Environment” is an exchange, the Agent is an trader. When we construct the environment, we are restricting the behaviors of agents like exchanges restrict traders in real life. For example, in A-share market, exchanges restrict that traders cannot short selling and set the margin levels etc. We can quantity these restrictions and simulate a trading environment based on our assumptions.
Besides, the “Ideal World” also contains the interaction rules between the agent and the environment. For example, exchanges record the balance of traders’ accounts and whenever there is an transaction, the balance will change at the mean time.
In real world, exchanges can let different types of traders enter into the market (arbitragers, speculators, hedgers, simpletons, etc.) as long as they obey the trading rules and laws. In the environmnet module, we do not care about the agent types either even though our target is to try to get a smart agent after training. We are just setting up an objective world based on our assumptions.

1.2 Quantify the Environment

gym is a package helping us turn our “Ideal World” into codes. Actually, the package has two main functions.
Firstly, it’s a kind of “dataset” containing many environments (Atari, MuJoCo, Toy Text Classic Control, Box2D, etc.). It acts as an arena for different algorithms (used to train agents) just like ImageNet and MNIST datasets. Take the Atari games for example,
gym demo: LunarLander-v2
gym demo: LunarLander-v2
# gym official example import gym env = gym.make("LunarLander-v2", render_mode="human") observation, info = env.reset(seed=42) for _ in range(1000): action = policy(observation) # User-defined policy function observation, reward, terminated, truncated, info = env.step(action) if terminated or truncated: observation, info = env.reset() env.close()
The scores of these games provide a measure of agents trained by different algorithms and thus provide us a reference for the comparison of different algorithms.
Secondly, it provides us the API (gym.Env) to custom our own environments. The API (class gym.Env) is user-friendly and is compatible with other deep learning packages (pytorch, tensorflow, stable baselines, tianshou, etc.). It’s quite easy to set up our “Ideal World“ using gym.Env by completing the four functions:
class customEnvironment(Env): def __init__(self, df): ... # input data, inner attributes, etc self.action_space = # requisite, how agents can behavior self.observation_space = # requisite, what kinds of information can agents observe (provided by environment) def step(self, action): ... # interaction between agents and environments, like the settlement process between traders and exchanges return observation, reward, done, {} # return new information omitted by the environment, the reward (requisite attributes), whether the epsisode ends (one round game over) and other information if any. def reset(self): ... # Initialize the environment, like exchanges clear all records of existing transcations return observation def render(self, mode='human', close=False): ... # Print the information of each step when called. In Atari games, it plots the game widgets all the time

2. Agent

2.1 An Intelligent Participant in the “Ideal World”

Take trading as an example, whether simpletons or experienced fund managers, they can access the market information (observations like stocks’ prices, breaking news, fundamental messages, etc.) and the rewards of their actions (transcations with exchange). The difference between the simpletons and experienced fund managers, however, is how they take a use of these market information (observations & rewards). In other words, they have different policies () instructing them to take actions further (make transcations with exchange) after grabbing the information (observations & rewards). can be deterministic or stochastic . Just like experienced market makers won’t behavior immutably which may expose their market making patterns, we often use stochastic policies to get robust agents.
A simpleton may not reflect current situations nor evaluate actions it’s taking. While an experienced fund managers always do so. We can quantify the reflection and evaluation using value funcitons.
Notations:
  • : status at time
  • : action taken at time
  • : reward of
Definitions:
  • Discounted return: , where can be finite or infinite
  • Action-value function: , where is the policy used
    • is the expected discounted return when taking action in the circumstance of status
    • For policy , evaluates how good it is for an agent to pick action while being in status
  • State-value function: , (discrete action space) (continuous action space)
    • is the expected discounted return w.r.t acitons under the circumstance of status
    • For fixed policy , evaluates how good the situation is in status
    • evaluates how good the policy is
  • Optimal Action-value function:
    • The expected discounted return of taking action in status under optimal policy in the environment
  • Optimal State-value function:
    • The expected discounted return of arbitrary actions in status under optimal policy in the environment

2.2 How Intelligent Agents Act

Firstly, suppose we have a good policy . Then once the intelligent agent observes status , it can random sample an action with .
Secondly, suppose we know the optimal action-value function . Then once we observe the status , we choose the action that maximizes the value

2.3 How Agents Become Intelligent

The above two ways intelligent agents act imply two main methods of Model-Free RL algorithms: Policy Optimation and Q-Learning. Actually, we can classify the RL algorithms as below
OpenAI Spinning Up: A Taxonomy of RL Algorithms
OpenAI Spinning Up: A Taxonomy of RL Algorithms
Model-Free v.s. Model-Based
notion image
The difference between training process of these two kinds of algorithms is whether there is an inner model simulating the environment. For Model-Based RL, it can estimate the observations & rewards of actions without even taking them thanks to the inner model. The inner model can be learned or given (like Alpha Zero). When the inner model is learnable, the real experience (observations & rewards) update the model each step as well.
Model-Free RL and Model-Based RL have different training paths. The explored space in one episode (one round game) of Model-Free RL (blue line) and Model-Based RL (blue line plus red lines) are shown below.
State space explored in on episode
State space explored in on episode
For Model-Free RL, they are comparably straight forward because there are no models or simulations. It uses the accurate environment instead of model predictions.
For Model-Based RL, they are more sample efficient which can decrease the chance of damaging hardward in some application scenario. However, the training process of Model-Based RL has higher computation cost.

3. Vanilla Model-Free RL Algorithms

3.1 Q-Learning

3.1.1 Gradient Descent
In this kind of algorithms, we try to approximate through training. Once we have it, we can choose the action that maximize during implementation, i.e. .
The core idea of updating our (denoted by while training) is using the equation below which is one kind of TD (temporary difference) algorithms.
where is the learning rate (a hyper-parameter) and is the discounted rate.
The equation is actually, kind of, applying the idea of “gradient descend“. The prediction from our model is and the target (label) is which is called TD target. The reason why it can be viewed as a “label” is that it contains , which is an observed value instead of an approximation and thus making TD target closer to the real value than . The TD target comes from the Bellman Equation
The last equation is based on the Markov Process assumption, which means the information
which implies
The last equation is using Monte Carlo approximation.
When we use L2 loss function that is where is the label, the gradient is . Following gradient descent method, we get equation .
As for the formation of Q-learning, there are two specific versions. Frist, if state space (observation space) and action space are both discrete which can be listed in a table, we can use Tabular Version. Second, if state space or action space is continuous, we can construct a network to approximate . The network we construct is called Deep Q Network (DQN).
3.1.2 Tabular Version
Since is a function having only two inputs, we can draw a table and list all possible states and actions in it.
👨🏻‍💻
Pseudocode: Tabular Version Q-learning Algorithm parameters: step size , small
Initialize , for all , arbitrarily except that
Loop for each episode:
Initialize
Loop for each step of episode:
Choose from using policy derived from (e.g. -greedy)
Take action , observe
until is terminal
Wikipedia: Tabular version Q-Learning training process
Wikipedia: Tabular version Q-Learning training process
3.1.3 Deep Q Network (DQN)
Denote the network we construct by which is parameterized by and has inputs . The output is the (optimal) score of taking in status . One Mario game example is shown below where the DQN is a convolution network because inputs are images. We should design specifical networks for different issues. For example, RNN and LSTM architectures are better choices for time-series inputs.
DQN: Mario game example
DQN: Mario game example
At this time the TD target is
And the loss function is
Using gradient descent method, we can update the parameters by
👨🏻‍💻
Pseudocode: DQN Algorithm parameters: step size , small
Initialize with random parameters
Loop for each episode:
Initialize
Loop for each step of episode:
Choose from using policy derived from (e.g. -greedy)
Predict the value:
Differentiate the value network:
Take action , observe
Calculate TD target:
Gradient descent:
until is terminal

3.2 Policy Optimization

3.2.1 Gradient Ascend
In this kind of algorithms, we try to approximate through training. Once we have it, we can randomly draw an action with . One Mario game example is shown below. The final layer is softmax which makes sum of outputs equal to 1.
Policy Network: Mario game example
Policy Network: Mario game example
Denote the network we construct to approximate by . Recall the State-value function is . We do not know either. There are two methods to calculate it, Reinforce method and Actor-Critic method.
3.2.2 Reinforce Method
It means that we do not update the network until we go through one episode (complete one round game). At this time, is not regarded as a funtion parameterized by but determined by observed trajectories. This method is also called Reinforce method.
Assume one trajectory is , where is drawn from policy network and state transitions and rewards are determined by the environment. Then we can use the observation as an approximation of . Thus . Differentiate w.r.t we get
The last equation is using MC approximation, where is randomly sampled according to the PDF . Obviously, is an unbiased estimate of .
👨🏻‍💻
Pseudocode: Reinforce Algorithm: Monte-Carlo Policy-Gradient Control (episodic) for
Input: a differentiable policy parameterization
Algorithm parameter: step size
Initialize policy parameter
Loop for each episode:
Generate an episode , following
Loop for each step of episode:
The algorithm is trying to make bigger by gradient ascent. Bigger means the agent is performing better.
Besides, there is an empirical trick to train policy network efficiently. Using baseline , which does not vary with , the expected value of the update is unchanged while the variance can be smaller. Therefore, the updating algorithm can converge faster. can be learnable or not. Denote , replacing with yields the Reinforce Method with Baseline.
3.2.3 Actor-Critic Method
It means we update the network with the progress of episode and is also an learnable network. That is where are parameters of network and respectively. In each step during one episode, we have two training tasks:
  • Update policy network to increase the state-value : It makes actor gradually performs better. Note that supervision is purely from the value network (critic) which may not be ground truth.
  • Update value network to better estimate the return: It makes critic’s judgement more accurate. The supervision is purely from the rewards which is observed from the environment.
The training process is shown as below
Actor-Critic Algorithm
Actor-Critic Algorithm
👨🏻‍💻
Pseudocode: A2C Input: a differentiable policy parameterization , a differentiable state-value parameterization
Algorithm parameter: step size
Initialize policy parameter
Loop for each episode:
Initialize
Loop for each step of episode:
Sample randomly according to
Take action , observe
Sample randomly according to (do not perform )
Evaluate value network: and
Compute TD error: (using baseline — A2C)
Differentiate value network:
Update value network:
Differentiate policy network:
Update policy network:
until is terminal

3.3 On-Policy v.s. Off-Policy

There is a kind of classification dividing Deep Reinfocement Learning algorithms into two categories: on-policy algorithms and off-policy algorithms.
As vanilla model-free algorithms shown above, at each step during one episode, agent will choose an action to behave according to a policy. This policy is called behavior policy. For example, behavior policy of DQN is Epsilon-Greedy policy where agent’s actions start off completely randomly and slowly begin to get more greedy and exploitative over time (not being greedy all the time). When calculating action value , agent will choose imagine an action to imitate according to a policy. This policy is called update policy. For DQN, the update policy assumes that we’re following a greedy policy that we are choosing actions which we believe will net us the most reward (being greedy all the time).
On-policy algorithms are those whose behavior policy is the same as its update policy. Off-policy algorithms are those whose behavior policy is different from its update policy. Therefore, DQN is one kind of off-policy algorithms while policy optimization algorithms are on-policy algorithms.

4. Further Issues and Advanced Models

4.1 Experience Replay

Recall above all training algorithms, the data used in each step (during one episode) is no more than which is defined as a transition. It can be viewed as a single training data like a piece of photo in CV tasks.
For vanilla model-free algorithms, there is a waste of experience because is only used once, followed by . What’s worse, and are strongly correlated. Consecutive states make the on-policy algorithms using transitions sequentially harmful to the training process.
Experience replay is a trick which exploits a replay buffer storing transitions as below
Replay Buffer
Replay Buffer
During training, we can randomly sample a transition from replay buffer to calculate TD error and apply gradient descent.

4.2 Prioritized Experience Replay

Actually, not all transitions are equally important. Take Mario game for example, transitions below left are very common while transitions below right are very rare. If we let both be sampled at the same rate, the agent we train most likely not to pass the right scenario. Therefore, we should use importance sampling rather than uniform sampling.
Mario Transitions with Different Importances
Mario Transitions with Different Importances
  • Option 1: Sampling probability
  • Option 2: Sampling probability
    • Transitions are sorted so that is in the descending order
    • rank(t) is the rank of the t-th transition
In sum, big shall be given high priotity.
Besides, we should scale learning rate to eliminate prediction bias. And if a transition is newly collected whose is unknown, we simply set its to the maximum and give it the highest priority.
The prioritized experience replay is shown as below.
Prioritized Experience Replay Summary
Prioritized Experience Replay Summary

4.3 Overestimation, Target Network, Double DQN

4.3.1 Overestimation
TD learning makes DQN overestimate action values. There are two reasons for it: Maximization and Bootstrapping.
Maximization
Let be observed real numbers. Add zero-mean random noise to and obtain . The zero-mean noise does not affect the mean that is . But the zero-mean noise increases the maximum that is and decreases the minimum that is .
As for action values in DQN, let be true action-values. The noisy estimation made by DQN are and suppose they are unbiased that is . However, is typically an overestimation .
Therefore, is an overestimation of the true action value at time . The TD target, is thereby an overestimation. Since TD learning pushes towards which overestimates the true action-value, DQN’s estimations are also overestimated than true values.
Bootstrapping
In RL, bootstrapping means using an estimated value in the update step for the same kind of estimated value. TD learning performs bootstrapping because TD target in part uses which is an overestimation. When is used for updating , the overestimation is propagated back to DQN.
Maximization and Bootstrapping make DQN overestiamte action values. What’s worse, the overestimation is non-uniform (otherwise overestimation is not a problem). When we use transition to update , the TD target overestimates . TD algorithm pushes towards which make overestimates . The more frequently appears in the replay buffer, the worse overestimates .
4.3.2 Target Network
Target Network is proposed to avoid bootstrapping. Target network has the same network structure as the DQN while has different parameters: .
During training, we use to control the agent and collect experience: and use to compute TD target: . Note that in each step during one episode, we only update . There are two ways to update target network periodically:
However, target network cannot eliminate maximization.
4.3.3 Double DQN
When calculating TD target, we can divide it into to steps: choose the action and calculate action value.
Algorithms
Action Selection
Evaluation
Naive DQN
Use DQN:
Use DQN:
Target Network
Use Target Network:
Use Target Network:
Double DQN
Use DQN:
Use Target Network:
Double DQN is the best among the three though overestimation still happens. Double DQN can alleviate overestimation because:

4.4 Dueling Network

Dueling network changes the architecture of DQN to train a more robust agent. Define optimal advantage function: . It implies that
Thus
In the new architecture, we approximate by a neural network , approximate by a neural network . Thus, approximate by the dueling network:
which has two parameters: . Denote as .
A dueling network architecture for Mario game example is shown as below
Dueling Network for Mario Game
Dueling Network for Mario Game
The learning process of dueling network is the same as other DQNs. Tricks can be used in the same way like: Prioritized experience repaly, Double DQN, Multi-step TD target, etc.
Although in equation , is equal to zero, it can help to overcome non-identifiability. Thus, dueling network has robust performance. In practice, we can also repalce with which yields better performance in experiments.

4.5 Deterministic Policy Gradient (DPG)

Model-Free algorithms discussed above, however, all have discrete action space. Networks’ outputs are finite. For continuous action space, we can discretize the continuous dimensions. But this may result in dimension disaster which makes models hard to converge. For example, the robotic arm has two continuous dimensions. If we divide each dimensions into ten discrete parts (which obviously cannot perform well), there will be 20 dimensions.
Robotic arm which has two continuous dimensions
Robotic arm which has two continuous dimensions
DPG (or DDPG: Deep Deterministic Policy Gradient) is one kind of Actor-Critic method. The architecture is shown below
DPG network architecture
DPG network architecture
Unlike Actor-Critic algorithms mentioned above, DPG’s policy network (actor) is deterministic whose output is a determined value instead of probability distribution. The value network (critic) has two inputs . It’s output is a scalar evaluating how good the action is.
👨🏻‍💻
Pseudocode: DPG
Input: a differentiable policy parameterization , a differentiable state-value parameterization
Algorithm parameter: step size
Initialize policy parameter
Loop for each episode:
Initialize
Loop for each step of episode:
Take action , observe
Value network makes prediction for time :
Value network makes prediction for time : where but is not taken
Compute TD error: (using baseline)
Differentiate value network:
Update value network:
Calculate DPG:
Update policy network:
until is terminal
Target Networks can be applied during the training of policy network to avoid bootstrapping. Target Networks have the same network structures as original network but with different parameters. At this time, value network makes a prediction for time : while target network makes a prediction for time :
The parameters of network is updated periodically.
Besides, tricks like experience replay, mult-step TD target, etc. can also be applied during the training of DPG.
In the end, the comparisons between stochastic policy and deterministic policy are shown below.
Comparisons betwoo stochastic policy and deterministic policy
Comparisons betwoo stochastic policy and deterministic policy

4.6 Stochastic Policy for Continuous Action Space

For continuous action space, we can also design a different stochastic policy network. Let the degree of freedom be (the robotic arm above has two). Let be functions of state . Let be the i-th elements of respectively. The policy function is then the PDF of multivariate normal:
We use networks to approximate . For , we use . For , denote . Approxiating with yields better performance in practice.
Once we get well-trained and , we can compute mean and log variance using the network when observing state :
Then compute . Finally, randomly sample action by .
To train the policy network , we should apply another network — Auxiliary Network (only used while training).
First, take log funtion of the policy network:
The whole network architecture is shown as below. Note that layers before dense layer should be designed according to specific problems, here is only a simple example.
Stochastic Policy network for continuous action space: network architecture
Stochastic Policy network for continuous action space: network architecture
From the definition: . Therefore,
  • In Reinforce Method:
  • In A2C Method:

4.7 Trust Region Policy Optimization (TRPO)

The Policy Optimization algorithms mentioned above have two main problems. First, samples are not exploited efficiently. Whether the off-policy method (reinforce) or on-policy method (actor-critic), each transition is only used once for updating parameter of policy network. Second, the convergence process is not smooth because each transition only updates policy network once and the effect of bad transition has further influence than traditional supervised learning. It will make agents we train less robust. TRPO and PPO below are proposed to avoid these two problems.
Importance sampling lemma: the below equation is obvious
In policy optimization algorithms, the object funtion is where . can also be if using policy optimization with baseline ( means advantage). Accoring to importance sampling lemma,
Applicaing MC approximation, we can approximate by
is one trajectory observed in one episode using .
Then we can update using Trust Region Method
The subject condition is to avoid too large step size where can be
  • L2 regularization:
  • KL divergence:
This restriction makes the convergence curve smoother, resulting in more robust model. Besides, the optimization task takes many iterations to complete which uses one trajectory (samples) multiple times (sample efficient). However, there will be more computation cost during the optimization process. It’s a trade-off.
👨🏻‍💻
Pseudocode: TRPO
Input: a differentiable policy parameterization
Algorithm parameter: step size
Initialize policy parameter
Loop for each episode:
Generate an episode , following
Calculate
Approximation: , is parameter of last episode
Maximization (inner loop):

4.8 Proximal Policy Optimization (PPO)

PPO is an adjust of TRPO which is easier to calculate buy yields similar performance. The objective function of PPO is
is dynamic during training:
  • When , increase
  • When , decrease
PPO2 is another adjust algorithm, its objection funtion is
where is the advantage in baseline policy optimization and function has ouput as below left. ’s output is shown as below right.
notion image
👨🏻‍💻
Pseudocode: PPO
Input: a differentiable policy parameterization
Algorithm parameter: step size
Initialize policy parameter
Loop for each episode:
Generate an episode , following
Calculate
Calculate or
or

5. Multi-Agent Reinforcement Learning

5.1 Settings and Convergence

When there are more than one agents in the system, agents are usually not independent. Every agent’s action can affect the next state and thus every agent can affect all the other agents. Unless the agents are independent of each other, single-agent RL methods do not work well for MARL. There are four types common settings:
  1. Fully cooperative, e.g. industrial robots.
  1. Fully competitive, e.g. predator and prey.
  1. Mixed cooperative & competitive, e.g. robotic soccer.
  1. Self-interested, e.g. automated trading systems.
When no agent can get better expected return by improving its own policy, the MARL converges. If there is only one agent, convergence means the objective function does not increase any more. If there are muliple agents, Nash equilibrium means convergence.

5.2 Architectures of MARL

They are three typical architectures of MARL:
  • Fully decentralized: Every agent uses its own observations and rewards to learn its policy. Agents do not communicate.
  • Fully centralized: The agents send everything to the central controller. The controller makes decisions for all the agents.
  • Centralized training with decentralized execution: A central controller is used during training. The contraoller is disabled after training.
In MARL, single agent may or may not have full knowledge of the state . Let be the i-th agent’s observation. If it’s partial observation . Full observation means .
5.2.1 Fully Decentralized
During training, each agent interacts with environment independently and they do not share observations or actions. Take A2C for example, each agent has its own policy network (actor): and value network (critic): . At this time, the training process is the same as single agent setting.
Fully Decentralized Training
Fully Decentralized Training
During execution, each agent randomly samples an action from its own policy network.
Fully Decentralized Execution
Fully Decentralized Execution
This does not work well because the architecture ignores the influence between agents.
5.2.2 Fully Centralized
There is a policy network in central controller . During training, at each step in one episode, central controller receive observations and rewards of all agents and take them as inputs. Then central controller instructs actions to each agents to further interact with the environment.
Fully Centralized Training
Fully Centralized Training
During execution, central controller randomly samples actions from trained policy network and instructs each agent’s action.
Fully Centralized Execution
Fully Centralized Execution
In this architecture, communication and synchronization cost time and real-time decision is impossible.
5.2.3 Centralized Training with Decentralized Execution
In this architecture, each agent has its own policy network (actor): . The central controller has value networks (critic): . During training, the central controller knows all the agents’ observations, actions and rewards.
Centralized Training 1
Centralized Training 1
There are critic networks in central controller, specific for each agent. The central controller trains the critics for all . To update , TD algorithm take as inputs:
  • All the actions:
  • All the observation:
  • The -th reward:
Centralized Training 2
Centralized Training 2
Then each critic sends state value to its actor (agent’s policy network). Each agent locally trains the actor using policy gradient with input . Actors’ training can be local because they do not need inputs from other agents.
Centralized Training 3
Centralized Training 3
After training, the central controller is useless. During execution, each agent interacts with the environment independently by randomly sampling actions from its policy network .
Decentralized Execution
Decentralized Execution
Comparisons among these three architectures are shown below
Comparisons among architectures
Comparisons among architectures
5.2.4 Parameters Sharing
In architectures above, there are policy networks and value networks Parameter sharing means and for some and .
Whether there should be parameters sharing depends on application scenario. For example, soccer robots should not have shared parameters because they play different roles on the field. While Tesla atuo-drived cars can have shared parameters because each car is self-interested.

6. Pipeline of Trading Example

There are some popular packages implementing DRL algorithms like Stable-baselines3, Tianshou. Below is an example of PPO applied to train a trader agent.

References

  1. Prof. Shusen Wang: Deep Reinforcement Learning, Youtube.
  1. Prof. Hung-yi Lee: DRL Lecture 2: Proximal Policy Optimization (PPO), Youtube.
  1. OpenAI: Spinning Up.
  1. Richard S. Sutton, Andrew G. Barto: Reinforcement Learning: An Introduction, 2nd edition.
  1. OpenAI: Gym Documentation.
  1. DLR-RM: Stable-Baselines3 Docs.
  1. TSAIL (Tsinghua Machine Learning Group): Tianshou Docs.
  1. Adam King: Create custom gym environments from scratch — A stock market example.

Loading Comments...