alphazero下五子棋,code赏析
1、play
之前直接开始就开始train,结果导致学习mcts的时候,觉得甚是复杂,所以这里先讲如何去玩
play
class Play(object):
def __init__(self):
net = Net()
if USECUDA:#这个为false
net = net.cuda()
net.load_model("model.pt", cuda=USECUDA)
self.net = net
self.net.eval()#这样会打印出网络的结构,如果是字典的话,就是求里面的数据
#print("Play __init__ self.net.eval() is {0}".format(self.net.eval()))
def go(self):
print("One rule:\r\n Move piece form 'x,y' \r\n eg 1,3\r\n")
print("-" * 60)
print("Ready Go")
mc = MonteCarloTreeSearch(self.net, 1000)
node = TreeNode()
board = Board()
while True:
print("Play board.c_player is {0}".format(board.c_player))#白子走
if board.c_player == BLACK:
action = input(f"Your piece is 'O' and move: ")
action = [int(n, 10) for n in action.split(",")]
action = action[0] * board.size + action[1]
print("Play and action is {0}".format(action))#1-8,2-16这样依次下去
next_node = TreeNode(action=action)#如果不传入参数的话,里面的值就是None,就是默认的
else:#上一步有了trigger的动作,所以下一次循环就开始c_player = white
_, next_node = mc.search(board, node)#白子需要在
board.move(next_node.action)
board.show()
next_node.parent = None
node = next_node
if board.is_draw():
print("board bas all been drawed\n")
print("-" * 28 + "Draw" + "-" * 28)
return
if board.is_game_over():#游戏结束
if board.c_player == BLACK:
print("-" * 28 + "Win" + "-" * 28)
else:
print("-" * 28 + "Loss" + "-" * 28)
return
board.trigger()
play的过程十分简单,人选择一次棋子颜色c_player, 剩下的是alphazero计算出下一次的落子位置,来开始对弈。
- mcts,设置的num起到了什么作用?
应该是尝试多少次update mcts上的值,maybe是simulation的次数,好像不是,应该是round值
- Node.parent = None的原因是什么?
对于root节点,需要给他添加一个噪声项
mcts
alphazero如何知道下一步的落子位置,来源于mcts的search,关于mcts:
Each round【轮】 of Monte Carlo tree search consists of four steps:[4]
· ·Selection: start from root R and select successive child nodes down to a leaf node L. The section below says more about a way of choosing child nodes that lets the game tree expand towards most promising moves, which is the essence【精髓】 of Monte Carlo tree search.
· ·Expansion: unless L ends the game with a win/loss for either player, create one (or more) child nodes and choose node Cfrom one of them.
· ·Simulation【模拟】: play a random playout from node C. This step is sometimes also called playout or rollout.
· ·Backpropagation: use the result of the playout to update information in the nodes on the path from C to R.
Sample steps from one round are shown in the figure below. Each tree node stores the number of won/played playouts.
class MonteCarloTreeSearch(object):
def __init__(self, net,
ms_num=MCTSSIMNUM):
self.net = net
self.ms_num = ms_num#self.ms_num数值是400
print("self.ms_num is {0}".format(self.ms_num))
"""
1、从根节点开始往下搜索直到叶节点
2、将当前棋面使用神经网络给出落子概率和价值评估
3、然后从叶节点返回到根节点一路更新
"""
def search(self, borad, node, temperature=.001):
self.borad = borad
self.root = node#节点
for _ in range(self.ms_num):
node = self.root
borad = self.borad.clone()#为什么这里需要clone一个board?
print("node is {0} borad is {1}, num is {2}".format(node, borad, _))
print("node.is_leaf is {0}".format(node.is_leaf()))
while not node.is_leaf():#node.is_leaf()返回true或者false
print("while node.is_leaf is {0}".format(node.is_leaf()))#先暂时不走这里
node = node.select_child()
borad.move(node.action)#移动到clone的棋盘上
borad.trigger()#开始另外一个棋子开始移动,实际在borad上只更新了一步
print("search borad show and num is {0}".format(_))
borad.show()
# be carefull - opponent state
#已经有一部分forward开始被调用了
"""
Zero的net输入为历史盘面和当前盘面特征,二进制格式,即0和1,输出策略p和价值v,
其中p为在棋盘上每个点落子的概率,v为评估当前盘面下当前玩家胜利的概率。
"""
#net的先将他作为一个黑盒子
value, props = self.net(#应该是在他的前向函数里面进行返回的
to_tensor(borad.gen_state(), unsqueeze=True))#unsqueeze是处理成二维数据,不知道这里是不是
#print("before MonteCarloTreeSearch value is {0}".format(value))#tensor([[0.0450]], grad_fn=<TanhBackward>)
#print("before MonteCarloTreeSearch props is {0}".format(props))#torch.Size([1, 64])
value = to_numpy(value, USECUDA)[0]#USECUDA faise,这个应该是转换成np数据
#print("value is {0} USECUDA is {1}".format(value, USECUDA))
props = np.exp(to_numpy(props, USECUDA))#np.exp计算e的多少次方
print("after MonteCarloTreeSearch value is {0}".format(value))
#print("after MonteCarloTreeSearch props is {0}".format(props))#(64,)#原来的props都计算了e的props次方
# add dirichlet noise for root node# dirichlet狄氏噪音,这是个什么鬼呢?
print("node.parent is {0}\t borad.invalid_moves is {1}, node.parent is {2}".format(node.parent, borad.invalid_moves, node.parent))
if node.parent is None:#第一次这里返回的是none
props = self.dirichlet_noise(props)
# normalize,这里如何进行正则化呢?
#print("before now prop is {0}, borad.invalid_moves is {1}".format(props, borad.invalid_moves))
props[borad.invalid_moves] = 0.
#print("after now prop is {0}, borad.invalid_moves is {1}".format(props, borad.invalid_moves))
total_p = np.sum(props)#所有概率总和
#print("total_p is {0}\t props is {1}".format(total_p, props))#props实际上是一个list
if total_p > 0:
props /= total_p#why?
# winner, draw or continue
if borad.is_draw():#如果棋盘已经全部被画完了,那实际上游戏终止了,不此时应该是平局
print("enter value = 0.")
value = 0.#平均的话,value就是0
else:
print("not enter value = 0.\t borad.last_player is {0}".format(borad.last_player))
done = borad.is_game_over(player=borad.last_player)#这个是在何处被更新呢?我去,这个就是当前的c_player啊!
print("done is {0}".format(done))
if done:#输了,就是-1
value = -1.#最后一个下棋的,难道不是当前的player吗?如果是当前的play导致他赢了,那他不应该是1吗?不最后一个应该是他的对手
else:#下面应该有更新c_player的地方
node.expand_node(props)#需要扩展这个node吗?
while node is not None:#这里应该是更新mcts
value = -value#为什么这里是负数
node.backup(value)#q值成负数了
node = node.parent
print("search and node is {0}".format(node is not None))
action_times = np.zeros(borad.size**2)#动作的次数
for child in self.root.children:
action_times[child.action] = child.N
print("action_times is {0}".format(action_times))#(64,)#更加子树的times难道不需要去统计吗?
action, pi = self.decision(action_times, temperature)
print("search action is {0}, pi is {1}".format(action, pi))
for child in self.root.children:
if child.action == action:
print("search pi is {0}\n child is {1}".format(pi,child))
return pi, child#返回应该是下一个节点
@staticmethod
def dirichlet_noise(props, eps=DLEPS, alpha=DLALPHA):#DLEPS 0.25,DLALPHA 0.03
#np.random.dirichlet((1,1,1,1,1,1))产生dirichlet分布的数据
return (1 - eps) * props + eps * np.random.dirichlet(np.full(len(props), alpha))
@staticmethod
def decision(pi, temperature):#根据pi产生以一定的概率去选择动作
#temp -- temperature parameter in (0, 1] that controls the level of exploration
pi = (1.0 / temperature) * np.log(pi + 1e-10)#temperature用来控制探索的水平
#下面这两步就是softmax函数
pi = np.exp(pi - np.max(pi))
pi /= np.sum(pi)
action = np.random.choice(len(pi), p=pi)# np.arange(5) 中产生一个size为3的随机采样:
return action, pi
mcts大致长成这个鬼样子
每一轮的步鄹里面包含的三步分别如下:
selection
其实很简单,选择U值最大的child里面的TreeNode
def select_child(self):#子节点选择q+u值最大的值
index = np.argmax(np.asarray([c.uct() for c in self.children]))
return self.children[index]
def uct(self):#原来他是这里定义的函数, CPUCT是5
return self.Q + self.P * CPUCT * (np.sqrt(self.parent.N) / (1 + self.N))
expand
通过net会得到每个落子的props,在q+u值最大的child的treenode开始expand mcts
def expand_node(self, props):#先验证概率里面包含了action和prop吗?
#print("TreeNode props is {0}".format(props))
#enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标
#是产生了64个treeNode
self.children = [TreeNode(action=action, props=p, parent=self)
for action, p in enumerate(props) if p > 0.]#实际上就产生了64个对象
for action, p in enumerate(props):
pass
#print("action is {0} \t p is {1}".format(action, p))
#print("expand_node self.children is {0}".format(self.children))
其中N、Q、W在backpropagation里面进行update
def __init__(self,
action=None,
props=None,
parent=None):
self.parent = parent#估计这两个都是传入的节点
self.action = action
self.children = []
self.P = props # prior probability#先验概率
self.N = 0 # visit count
self.Q = .0 # mean action value#这里难道是指动作函数吗?
self.W = .0 # total action value
backprogation
def backup(self, v):#backup是什么意思呢?
self.N += 1
self.W += v#v应该是传入的q值
self.Q = self.W / self.N
backprogation是selection的反向过程,他只会更新该子树上的q、w值
NET
pytorch可以借助tensorboard进行可视化的,后面可以尝试:https://www.jianshu.com/p/46eb3004beca
但是我自己画完之后再print出来就觉得自己是个大傻逼,还是贴出来,毕竟蠢也画了时间
最后去看net structure
play games不需要去训练参数,参数在网络里看不见,中间过程你也看不见,所以将net这个黑盒子真的好黑
关于pytorch搭建net,以及执行时调用forward func另外一章上有提到。
2、train
单单只是玩耍还是相对简单的,但是我们如何去训练这个.modle呢?
训练modle的时候需要使用到train data。这里是随机在data_sets里面选择batch_size大小的数据
dataloader
class DataLoader(object):
def __init__(self, cuda, batch_size):
print("DataLoader batch_size is {0}\t cuda is {1}".format(batch_size, cuda))
self.cuda = cuda
self.bsz = batch_size
def __call__(self, datas):
print("enter __call__ datas is {0}".format(datas))#难道是没有进入这个函数里面吗?至少说在创建对象的时候是没有进来的
mini_batch = random.sample(datas, self.bsz)
states, pi, rewards = [], [], []
for s, p, r in mini_batch:
states.append(s)
pi.append(p)
rewards.append(r)
states = to_tensor(np.stack(states, axis=0), use_cuda=self.cuda)
pi = to_tensor(np.stack(pi, axis=0), use_cuda=self.cuda)
rewards = to_tensor(np.stack(rewards, axis=0), use_cuda=self.cuda)
return states, pi, rewards.view(-1, 1)
optimizer
class ScheduledOptim(object):
def __init__(self, optimizer, lr):
self.lr = lr
self.optimizer = optimizer
def step(self):#对应优化器的两个操作
self.optimizer.step()
def zero_grad(self):#对应优化器的两个操作
self.optimizer.zero_grad()
def update_learning_rate(self, lr_multiplier):#更新lr
print("enter update_learning_rate lr_multiplier is {0}".format(lr_multiplier))
new_lr = self.lr * lr_multiplier
for param_group in self.optimizer.param_groups:
param_group['lr'] = new_lr
entrory
class AlphaEntropy(nn.Module):#Entropy熵,熵是不确定性的度量
def __init__(self):
super().__init__()
self.v_loss = nn.MSELoss()#均方差
#print("self.v_loss is {0}".format(self.v_loss))
def forward(self, props, v, pi, reward):#props 是先验概率
print("AlphaEntropy forward v is {0}\t reward is {1}".format(v, reward))
v_loss = self.v_loss(v, reward)#reward是RL中的奖励吗?
p_loss = -torch.mean(torch.sum(props * pi, 1))
print("AlphaEntropy forward p_loss + v_loss is {0}".format(p_loss + v_loss))
return p_loss + v_loss
大致流程如下:
- Black start后self play玩完整个game,并得到本次self-play的data_sets数据
- 在data_sets中抽取sample_data进行loss计算,backforward更新参数
- Checkout让最新的AI和MCTS AI进行对战,eval modle
- According eval result update modle args
play_game
class Game(object):
def __init__(self, net, evl_net):
self.net = net
self.evl_net = evl_net
self.board = Board()#board是产生这个棋盘的意思吗?
def play(self):#应该是先执行这个函数
datas, node = [], TreeNode()
#print("datas is {0}\t node is {1}".format(datas, node))
mc = MonteCarloTreeSearch(self.net)#蒙特卡洛树搜索,这个还是有空了再去看吧!
move_count = 0
while True:
#print("move_count is {0}".format(move_count))
if move_count < TEMPTRIG:#TEMPTRIG这个值是8
pi, next_node = mc.search(self.board, node, temperature=1)
#返回的是pi和下一个节点
else:
pi, next_node = mc.search(self.board, node)
datas.append([self.board.gen_state(), pi, self.board.c_player])#将数据丢到data里面去
self.board.move(next_node.action)
next_node.parent = None#为什么这里需要是None呢?因为每走一步都去建立一个mcts
node = next_node
if self.board.is_draw():
reward = 0.
break
if self.board.is_game_over():
reward = 1.
break
self.board.trigger()
move_count += 1
#self.board.show()
datas = np.asarray(datas)
datas[:, 2][datas[:, 2] == self.board.c_player] = reward
datas[:, 2][datas[:, 2] != self.board.c_player] = -reward
return datas
def evaluate(self, result):
self.net.eval()
self.evl_net.eval()
if random.randint(0, 1) == 1:
players = {
BLACK: (MonteCarloTreeSearch(self.net), "net"),
WHITE: (MonteCarloTreeSearch(self.evl_net), "eval"),
}
else:
players = {
WHITE: (MonteCarloTreeSearch(self.net), "net"),
BLACK: (MonteCarloTreeSearch(self.evl_net), "eval"),
}
node = TreeNode()
while True:#这里又是走一个完整的回合
print("self.board.c_player is {0}".format(self.board.c_player))
print("players[self.board.c_player][0] is {0}".format(players[self.board.c_player][0]))
print("players[self.board.c_player][1] is {0}".format(players[self.board.c_player][1]))
_, next_node = players[self.board.c_player][0].search(
self.board, node)
self.board.move(next_node.action)
if self.board.is_draw():
result[0] += 1
return
if self.board.is_game_over():
if players[self.board.c_player][1] == "net":
result[1] += 1
else:
result[2] += 1
return
self.board.trigger()
next_node.parent = None
node = next_node
def reset(self):
self.board = Board()
update_args
class Train(object):
def __init__(self, use_cuda=USECUDA, lr=LR):
print("use_cuda is {0}".format(use_cuda))#这个值是false
if use_cuda:
torch.cuda.manual_seed(1234)
else:
torch.manual_seed(1234)#是和torch相关的获取随机数
self.kl_targ = 0.02#这个是什么意思呢?
self.lr_multiplier = 1.#multiplier是乘数
self.use_cuda = use_cuda#false
self.net = Net()
self.eval_net = Net()#为什么这里需要定义两个net
if use_cuda:#这里是false
self.net = self.net.cuda()#cuda应该是他父类的一个函数
self.eval_net = self.eval_net.cuda()
#这个难道是里面批处理的接口
#self.dl是生成的一个对象
self.dl = DataLoader(use_cuda, MINIBATCH)#MINIBATCH 512
#print("self.dl is {0}".format(self.dl))
self.sample_data = deque(maxlen=TRAINLEN)#队列 TRAINLEN数据是10000
#print("self.sample_data is {0}".format(self.sample_data))
self.gen_optim(lr)#这个是获得优化器吗?
self.entropy = AlphaEntropy()#Entropy是熵的意思
print("end of Train __init__ TRAINLEN is {0}".format(TRAINLEN))
def run(self):
model_path = f"model_{time.strftime('%Y%m%d%H%M', time.localtime())}.pt"
print("model_path is {0}".format(model_path))
self.net.save_model(path=model_path)#只是存储他的参数
self.eval_net.load_model(path=model_path, cuda=self.use_cuda)
print("GAMETIMES is {0}".format(GAMETIMES))
for step in range(1, 1 + GAMETIMES):#GAMETIMES的值是3000
game = Game(self.net, self.eval_net)#传入的是两个网络
print(f"Game - {step} | data length - {self.sample(game.play())}")#加上了f后{}将成为变量或者表达式,也就是说会执行到simple函数,应该是先执行了play函数
if len(self.sample_data) < MINIBATCH:
continue
game.board.show()
states, pi, rewards = self.dl(self.sample_data)#saple_data里面的queue的值是在什么时候被塞进去的呢?应该是在games里面
#print("run states is {0}\n pi is {1}\n rewards is {2}".format(states, pi, rewards))
_, old_props = self.net(states)#这里是完成的一轮的数据
#break
for _ in range(EPOCHS):
self.optim.zero_grad()
v, props = self.net(states)
loss = self.entropy(props, v, pi, rewards)#熵的计算
loss.backward()#反向传递去更新参数
self.optim.step()
_, new_props = self.net(states)
#果tensor只有一个元素那么调用item方法的时候就是将tensor转换成python的scalars;
#如果tensor不是单个元素的话那就会引发ValueError
kl = torch.mean(torch.sum(
torch.exp(old_props) * (old_props - new_props), 1)).item()
if kl > self.kl_targ * 4:
break
if kl > self.kl_targ * 2 and self.lr_multiplier > 0.1:
self.lr_multiplier /= 1.5
elif kl < self.kl_targ / 2 and self.lr_multiplier < 10:
self.lr_multiplier *= 1.5
self.optim.update_learning_rate(self.lr_multiplier)
print(
f"kl - {kl} | lr_multiplier - {self.lr_multiplier} | loss - {loss}")
print("-" * 100 + "\r\n")
if step % CHECKOUT == 0:#CHECKOUT是50
result = [0, 0, 0] # draw win loss
for _ in range(EVALNUMS):#EVALNUMS是20
评估的方式是使用当前最新的 AI 模型和纯的 MCTS AI(基于随机 rollout)
game.reset()#为什么这里要做reset呢?
game.evaluate(result)
break
if result[1] + result[2] == 0:
rate = 0
else:
rate = result[1] / (result[1] + result[2])
print(f"step - {step} evaluation")
print(
f"win - {result[1]} | loss - {result[2]} | draw - {result[0]}")
# save or reload model
if rate >= WINRATE:
print(f"new best model. rate - {rate}")
self.net.save_model(path=model_path)
self.eval_net.load_model(
path=model_path, cuda=self.use_cuda)
else:
print(f"load last model. rate - {rate}")
self.net.load_model(path=model_path, cuda=self.use_cuda)
print("-" * 100 + "\r\n")
#break
def gen_optim(self, lr):
optim = torch.optim.Adam(self.net.parameters(), lr=lr, weight_decay=L2)#优化器
#print("optim is {0}".format(optim))
self.optim = ScheduledOptim(optim, lr)#ScheduledOptim这个是什么东西呢?
#print("self.optim is {0}".format(self.optim))#生成一个对象
def sample(self, datas):#转换game,play的参数
#print("datas is {0}".format(datas))
for state, pi, reward in datas:
c_state = state.copy()
c_pi = pi.copy()
for i in range(4):
c_state = np.array([np.rot90(s, i) for s in c_state])
c_pi = np.rot90(c_pi.reshape(SIZE, SIZE), i)
self.sample_data.append([c_state, c_pi.flatten(), reward])
c_state = np.array([np.fliplr(s) for s in c_state])
c_pi = np.fliplr(c_pi)
self.sample_data.append([c_state, c_pi.flatten(), reward])
return len(datas)
附
Play __init__ self.net.eval() is Net(
(feat): Feature(
(layer): Sequential(
(0): Conv2d(8, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(encodes): ModuleList(
(0): ResBlockNet(
(layers): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): ResBlockNet(
(layers): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(2): ResBlockNet(
(layers): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(3): ResBlockNet(
(layers): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(4): ResBlockNet(
(layers): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(5): ResBlockNet(
(layers): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(6): ResBlockNet(
(layers): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(7): ResBlockNet(
(layers): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(8): ResBlockNet(
(layers): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(9): ResBlockNet(
(layers): Sequential(
(0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
(3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
)
(value): Value(
(conv): Sequential(
(0): Conv2d(128, 1, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(linear): Sequential(
(0): Linear(in_features=64, out_features=256, bias=True)
(1): ReLU()
(2): Linear(in_features=256, out_features=1, bias=True)
(3): Tanh()
)
)
(policy): Policy(
(conv): Sequential(
(0): Conv2d(128, 2, kernel_size=(1, 1), stride=(1, 1))
(1): BatchNorm2d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(linear): Linear(in_features=128, out_features=64, bias=True)
)
)
细节
- 温度:在self-play的前30步中使温度等于1,其他时候包括评估模型变现时使用一个极小值
- 根节点使用Dirichlet初始化P,使得可以尝试更多的落子选择
- 每落子一次,将以选择子节点更新为新的根节点
- 输入net的特征分别为8张仅有黑子和8张仅有白子已经当前玩家落子的棋盘
- 损失函数使用mean-squared和cross-entropy(公式如下图)
- 训练时对输入特征使用旋转和反转方式上采样
- 在mcts中,获取的每一步概率需要经过softmax处理
- 在mcts中,每次更新父节点时注意value的正负值
- 五子棋和围棋不同的是判断棋局正负时只需要关注最后落子方
- 神经网络在Zero中并不起到决定性作用,就算换成概率平均分布经过多次模拟之后也可以实现对弈
上面一章看了alphzero的code,但是发现和RL感觉没啥子关系啊!现在来看看alphgo zero的code来试试
https://github.com/junxiaosong/AlphaZero_Gomoku
1、play
initial和play和alphzero类似
那human get_action实际很简单:接收命令行的输入
mctsplayer实际是通过获取每个action的props来以一定的概率选择action
那mctsplayer如何得到每个action的props呢?和alphazero一样通过更新mcts的q+u值,然后累积计算visits,最后通过softmax将visits转换成props
至于net,在PolicyValueNetNumpy.policy_value_fnc传入net,
在action_probs, leaf_value = self._policy(state)#policy策略传入当前的board作为参数中被调用
2、train
感觉和alphzero并无多大差异啊!里面都包含了self-play,和RL又扯上什么关系呢?