1. Model-Free
Value-based
- State Action Reward State-Action (SARSA) – 1994
- Q-learning = SARSA max – 1992
- Deep Q Network (DQN) – 2013
- Double Deep Q Network (DDQN) – 2015
- Deep Recurrent Q Network (DRQN) – 2015
- Dueling Q Network – 2015
- Persistent Advantage Learning (PAL) – 2015
- Bootstrapped Deep Q Network – 2016
- Normalized Advantage Functions (NAF) = Continuous DQN – 2016
- N-Step Q Learning – 2016
- Noisy Deep Q Network (NoisyNet DQN) – 2017
- Deep Q Learning for Demonstration (DqfD) – 2017
- Categorical Deep Q Network = Distributed Deep Q Network = C51 – 2017
- Rainbow – 2017
- Mixed Monte Carlo (MMC) – 2017
- Neural Episodic Control (NEC) – 2017
Policy-based
- Cross-Entropy Method (CEM) – 1999
- Policy Gradient
- REINFORCE = Vanilla Policy Gradient (VPG)- 1992
- Policy gradient softmax
- Natural policy gradient (NPG) – 2002
- Truncated Natural Policy Gradient (TNPG)
Actor-Critic
- Advantage Actor Critic (A2C)
- Asynchronous Advantage Actor-Critic (A3C) – 2016
- Generalized Advantage Estimation (GAE) – 2015
- Trust Region Policy Optimization (TRPO) – 2015
- Deterministic Policy Gradient (DPG) – 2014
- Deep Deterministic Policy Gradients (DDPG) – 2015
- Distributed Distributional Deterministic Policy Gradients (D4PG) – 2018
- Twin Delayed Deep Deterministic Policy Gradient (TD3) – 2018
- Distributed PPO (DPPO) – 2017
- Clipped PPO (CPPO) – 2017
- Actor Critic using Kronecker-Factored Trust Region (ACKTR) – 2017
- Actor-Critic with Experience Replay (ACER) – 2016
- Soft Actor-Critic (SAC) – 2018
General Agents
- Direct Future Prediction (DFP) – 2016
- Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
- Relative Entropy Policy Search (REPS)
- Reward-Weighted Regression (RWR)
Imitation Learning Agents
- Behavioral Cloning (BC)
- Conditional Imitation Learning – 2017
- Generative Adversarial Imitation Learning (GAIL) – 2016
Hierarchical Reinforcement Learning Agents
- Hierarchical Actor Critic (HAC) – 2017
Memory Types
- Prioritized Experience Replay (PER) – 2015
- Hindsight Experience Replay (HER) – 2017
Exploration Techniques
- E-Greedy
- Boltzmann
- Ornstein–Uhlenbeck process
- Normal Noise
- Truncated Normal Noise
- Bootstrapped Deep Q Network
- UCB Exploration via Q-Ensembles (UCB)
- Noisy Networks for Exploration
- Intrinsic Curiosity Module (ICM) – 2017
2. Model-Based
- DYNA-Q
- Dataset Aggregation (Dagger)
- Monte Carlo Tree Search (MCTS) (eg. AlphaZero)
- Dynamic Programming
- Model Predictive Control
- Probabilistic Inference for Learning COntrol (PILCO)
- Guided Policy Search (GPS)
- Policy search with Gaussian Process
- Policy search with backpropagation
Summary
Algorithm | Model-free or model-based | Agent type | Policy | Policy type | Monte Carlo or Temporal difference (TD) | Action space | State space |
Tabular Q-learning (= SARSA max) Q learning lambda | Model free | Value-based | Off-policy | Pseudo-deterministic (epsilon greedy) | TD | Discrete | Discrete |
SARSA SARSA lambda | Model free | Value-based | On-policy | Pseudo-deterministic (epsilon greedy) | TD | Discrete | Discrete |
DQN N step DQN Double DQN Noisy DQN Prioritized Replay DQN Dueling DQN Catergorical DQN Distributed DQN (C51) | Model free | Value-based | Off-policy | Pseudo-deterministic (epsilon greedy) | | Discrete | Continuous |
Cross-entropy method | Model free | Policy-based | On-policy | | Monte Carlo | | |
REINFORCE (Vanilla policy gradient) | Model free | Policy-based | On-policy | Stochastic policy | Monte Carlo | | |
Policy gradient softmax | Model free | | | Stochastic policy | | | |
Natural Policy Gradient | Model free | | | Stochastic policy | | | |
TRPO | Model free | Policy-based | On-policy (?) | Stochastic policy | | Continuous | Continuous |
PPO | Model free | Policy-based | On-policy (?) | Stochastic policy | | Continuous | Continuous |
Distributed PPO | Model free | Policy-based | | | | Continuous | Continuous |
A2C | Model free | Actor-critic | On-policy | Stochastic policy | TD | Continuous | |
A3C | | Actor-critic | On-policy | | | | |
DDPG (A2C family) | Model free | Actor-critic | Off-policy | Deterministic policy | | Continuous | Continuous |
TD3 | Model free | Actor-critic | | | | Continuous | Continuous |
D4PG | | | | | | | |
SAC | Model free | Actor-critic | Off-policy | | | | |
Dyna-Q | | | | | | | |
Curiosity Model | | | | | | | |
NAF | Model free | | | | | Continuous | |
DAgger | | | | | | | |
MCTS | | | | | | | |
Dynamic programming | | | | | | | |
GPS | | | | | | | |
Model Predictive Control | Model-based | | | | | | |
PILCO | Model-based | | | | | | |
Policy search with Gaussian Process | Model-based | | | | | | |
Policy search with backpropagation | Model-based | | | | | | |
Conclusion
We have just seen some of the most used RL algorithms. In the next article, we will look at the challenges and application of RL for robotic applications.