强化学习算法

主要应用在游戏AI、自动驾驶、仿生机器人等场景。如在某个环境(游戏、驾驶路上),执行特定动作后,导致状态发生改变,产生不同结果(对应不同激励值,有正负),这种类型的场景都能使用强化学习。

其中 DQN 算法是由 Google 提出的,不久前 Google 的 AlphaGo (围棋机器人)就是由该算法演变产生。

传统强化学习方法

Q-learning

Q-learning 属于 off-policy 算法。用一个Q tabel记录每个state各个action的Q值(权重)。用下一个state概率最大的action的Q值更新当前state当前action的Q值。通过将最终激励值往前传播,达到学习的目的。

强化学习与人工智能的关系 人工智能强化算法_人工智能
强化学习与人工智能的关系 人工智能强化算法_深度神经网络_02

State-Action-Reward-State-Action (SARSA)

SARSA 属于 on-policy 算法。用一个Q tabel记录每个state各个action的Q值(权重)。用下一个state选中将要执行的action的Q值更新当前state当前action的Q值。

强化学习与人工智能的关系 人工智能强化算法_python_03
强化学习与人工智能的关系 人工智能强化算法_深度神经网络_02

Deep Q Network (DQN)

DQN 算法,由 DeepMind 在 2013 年在 NIPS 提出。DQN 算法的主要做法是 Experience Replay,其将系统探索环境得到的数据储存起来,然后随机采样样本更新深度神经网络的参数。

Experience Replay 的动机是:1)深度神经网络作为有监督学习模型,要求数据满足独立同分布,2)但 Q Learning 算法得到的样本前后是有关系的。为了打破数据之间的关联性,Experience Replay 方法通过存储-采样的方法将这个关联性打破了。

DeepMind 在 2015 年初在 Nature 上发布了文章,引入了 Target Q 的概念,进一步打破数据关联性。Target Q 的概念是用旧的深度神经网络 w^{-} 去得到目标值,下面是带有 Target Q 的 Q Learning 的优化目标。

用神经网络建模 Q function,替代Q tabel,基于Q-learning算法演变而成。
网络改进如下:

  • 加入记忆库,记录出现过的state-action(当前状态选择对应动作后)、激励值、及执行action后的下一个state。
  • 使用神经网络计算Q值,对应eval模型。
  • 采用单独的target模型保存以前的参数,暂时冻结q_target参数,切断相关性。

Q值更新基于target模型的值强化学习与人工智能的关系 人工智能强化算法_算法_05,经过下面计算获取激励后的Q值,该值由冻结模型计算,与当前预测值有一定差距,可切断关联性,使模型能够收敛。

强化学习与人工智能的关系 人工智能强化算法_人工智能_06
强化学习与人工智能的关系 人工智能强化算法_深度神经网络_02

学习速率包含在模型训练时使用的梯度下降方法里。

计算强化学习与人工智能的关系 人工智能强化算法_算法_08强化学习与人工智能的关系 人工智能强化算法_深度神经网络_09的平方误差,然后更新eval模型权重。

每隔一段时间把eval模型参数更新到target模型。

该教程源码基于莫凡Demo修改,使用Tenserflow 2.0运行。

运行代码,源码地址:https://github.com/tfwcn/Reinforcement-learning-with-tensorflow/blob/master/contents/5_Deep_Q_Network/run_this.py

from maze_env import Maze
from RL_brain_tf2 import DeepQNetwork


def run_maze():
    step = 0
    for episode in range(300):
        # initial observation
        observation = env.reset()

        while True:
            # fresh env
            env.render()

            # RL choose action based on observation
            # 根据state选择action,有10%概率随机选择,其他按最大action选择
            action = RL.choose_action(observation)

            # RL take action and get next observation and reward
            # 执行动作,获取下一个state,激励值reward,是否结束标识done
            observation_, reward, done = env.step(action)
            
            # 将state、action、激励值、下一状态放入记忆库
            RL.store_transition(observation, action, reward, observation_)

            # 50步以后,每5步从记忆库内随机选择一批记录进行训练
            if (step > 50) and (step % 5 == 0):
                RL.learn()

            # swap observation
            # 跳到下一个状态
            observation = observation_

            # break while loop when end of this episode
            # 判断结束游戏,进行下一轮学习
            if done:
                break
            step += 1

    # end of game
    print('game over')
    env.destroy()


if __name__ == "__main__":
    # maze game
    env = Maze()
    RL = DeepQNetwork(env.n_actions, env.n_features,
                      learning_rate=0.01,
                      reward_decay=0.9,
                      e_greedy=0.9,
                      replace_target_iter=50,
                      memory_size=2000,
                      # output_graph=True
                      )
    env.after(100, run_maze)
    env.mainloop()
    RL.plot_cost()

DQN代码,基于莫凡源码修改,使用Tensorflow 2.0实现,源码地址:https://github.com/tfwcn/Reinforcement-learning-with-tensorflow/blob/master/contents/5_Deep_Q_Network/RL_brain_tf2.py

"""
This part of code is the DQN brain, which is a brain of the agent.
All decisions are made in here.
Using Tensorflow to build the neural network.

View more on my tutorial page: https://morvanzhou.github.io/tutorials/

Using:
Tensorflow: 2.0
gym: 0.7.3
"""

import numpy as np
import pandas as pd
import tensorflow as tf

np.random.seed(1)
tf.random.set_seed(1)


# Deep Q Network off-policy
class DeepQNetwork:
    def __init__(
            self,
            n_actions,
            n_features,
            learning_rate=0.01,
            reward_decay=0.9,
            e_greedy=0.9,
            replace_target_iter=300,
            memory_size=500,
            batch_size=32,
            e_greedy_increment=None,
            output_graph=False,
    ):
        '''
        n_actions:4,动作数量(上下左右)
        n_features:2,状态数量(x,y)
        '''
        print('n_actions:', n_actions)
        print('n_features:', n_features)
        print('learning_rate:', learning_rate)
        print('reward_decay:', reward_decay)
        print('e_greedy:', e_greedy)
        self.n_actions = n_actions
        self.n_features = n_features
        self.lr = learning_rate
        self.gamma = reward_decay
        self.epsilon_max = e_greedy
        self.replace_target_iter = replace_target_iter
        self.memory_size = memory_size
        self.batch_size = batch_size
        self.epsilon_increment = e_greedy_increment
        self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max

        # total learning step
        self.learn_step_counter = 0

        # initialize zero memory [s, a, r, s_]
        self.memory = np.zeros((self.memory_size, n_features * 2 + 2))

        # consist of [target_net, evaluate_net]
        self._build_net()
        self.cost_his = []

    def _build_net(self):
        '''建立预测模型和target模型'''
        # ------------------ build evaluate_net ------------------
        s = tf.keras.Input([None, self.n_features], name='s')
        q_target = tf.keras.Input([None, self.n_actions], name='Q_target')
        # 预测模型
        x = tf.keras.layers.Dense(20, activation=tf.keras.activations.relu, name='l1')(s)
        x = tf.keras.layers.Dense(self.n_actions, name='l2')(x)
        self.eval_net = tf.keras.Model(inputs=s, outputs=x)
        # 损失计算函数
        self.loss = tf.keras.losses.MeanSquaredError()
        # 梯度下降方法
        self._train_op = tf.keras.optimizers.RMSprop(learning_rate=self.lr)

        # ------------------ build target_net ------------------
        s_ = tf.keras.Input([None, self.n_features], name='s_')
        # target模型
        x = tf.keras.layers.Dense(20, activation=tf.keras.activations.relu, name='l1')(s_)
        x = tf.keras.layers.Dense(self.n_actions, name='l2')(x)
        self.target_net = tf.keras.Model(inputs=s_, outputs=x)
    
    def replace_target(self):
        '''预测模型权重更新到target模型权重'''
        self.target_net.get_layer(name='l1').set_weights(self.eval_net.get_layer(name='l1').get_weights())
        self.target_net.get_layer(name='l2').set_weights(self.eval_net.get_layer(name='l2').get_weights())

    def store_transition(self, s, a, r, s_):
        '''存入记忆库'''
        if not hasattr(self, 'memory_counter'):
            self.memory_counter = 0

        transition = np.hstack((s, [a, r], s_))

        # replace the old memory with new memory
        index = self.memory_counter % self.memory_size
        self.memory[index, :] = transition

        self.memory_counter += 1

    def choose_action(self, observation):
        '''根据state选择action'''
        # to have batch dimension when feed into tf placeholder
        observation = observation[np.newaxis, :]

        if np.random.uniform() < self.epsilon:
            # forward feed the observation and get q value for every actions
            actions_value = self.eval_net(observation).numpy()
            action = np.argmax(actions_value)
        else:
            action = np.random.randint(0, self.n_actions)
        return action

    def learn(self):
        '''从记忆库学习'''
        # check to replace target parameters
        if self.learn_step_counter % self.replace_target_iter == 0:
            self.replace_target()
            print('\ntarget_params_replaced\n')

        # sample batch memory from all memory
        if self.memory_counter > self.memory_size:
            sample_index = np.random.choice(self.memory_size, size=self.batch_size)
        else:
            sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
        batch_memory = self.memory[sample_index, :]

        with tf.GradientTape() as tape:
            # 生成标签,是以前的模型(50次前)
            q_next = self.target_net(batch_memory[:, -self.n_features:]).numpy()
            # 生成预测结果,是当前的模型
            q_eval = self.eval_net(batch_memory[:, :self.n_features])

            # change q_target w.r.t q_eval's action
            q_target = q_eval.numpy()

            batch_index = np.arange(self.batch_size, dtype=np.int32)
            eval_act_index = batch_memory[:, self.n_features].astype(int)
            reward = batch_memory[:, self.n_features + 1]

            # 根据激励值更新当前的预测结果,对应的state-action=激励值+衰减系数*下一个最大action概率。这里只更新对应action,其余保留预测action,计算loss时只有对应action有loss。
            q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)

            """
            For example in this batch I have 2 samples and 3 actions:
            q_eval =
            [[1, 2, 3],
            [4, 5, 6]]

            q_target = q_eval =
            [[1, 2, 3],
            [4, 5, 6]]

            Then change q_target with the real q_target value w.r.t the q_eval's action.
            For example in:
                sample 0, I took action 0, and the max q_target value is -1;
                sample 1, I took action 2, and the max q_target value is -2:
            q_target =
            [[-1, 2, 3],
            [4, 5, -2]]

            So the (q_target - q_eval) becomes:
            [[(-1)-(1), 0, 0],
            [0, 0, (-2)-(6)]]

            We then backpropagate this error w.r.t the corresponding action to network,
            leave other action as error=0 cause we didn't choose it.
            """

            # train eval network
            self.cost = self.loss(y_true=q_target,y_pred=q_eval)
            # print('loss:', self.cost)
            
        gradients = tape.gradient(
            self.cost, self.eval_net.trainable_variables)

        self._train_op.apply_gradients(
            zip(gradients, self.eval_net.trainable_variables))
        self.cost_his.append(self.cost)

        # increasing epsilon
        # 随机概率随训练次数减少,默认不变
        self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
        self.learn_step_counter += 1

    def plot_cost(self):
        '''打印损失变化记录'''
        import matplotlib.pyplot as plt
        plt.plot(np.arange(len(self.cost_his)), self.cost_his)
        plt.ylabel('Cost')
        plt.xlabel('training steps')
        plt.show()