时序神经网络都有哪些时序网络模型

转载

时光机3号 2024-01-13 13:54:17

文章标签 时序神经网络都有哪些 pytorch 神经网络自然语言处理 hg 文章分类 神经网络人工智能

本文是在实现IVQA模型时的一些记录，该模型使用的是RNN建模，因此借此机会回顾一些Seq2Seq模型的写法，以及Pytorch的使用。

1.LSTM：

看结构图就可以明白LSTM的机理。

时序神经网络都有哪些时序网络模型_时序神经网络都有哪些

LSTM是一种使用了“门控”方式的RNN，最原始的RNN的结构上，其实就是一般的MLP网络，但是有一个“自回归的状态输出”。门控机制其实可以看作注意力机制，形式上是类似的。

具体的，LSTM有三个门，被称作是输入门（input gate)，遗忘门（forget gate)，输出门（output gate)。三个门依次控制信息的“流量”。和普通RNN相比，网络有两个隐藏状态，一个被称作隐藏状态(hidden state) $时序神经网络都有哪些时序网络模型_hg_02$ （其实也是输出），还有一个被称作单元状态(cell state) $时序神经网络都有哪些时序网络模型_自然语言处理_03$ （其实也蕴含了序列信息，事实上也是一种隐藏状态）

输入门：控制网络输入 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_04$ ，有多少比例保存到单元状态中 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_05$ 。
遗忘门：上一时刻的单元状态 $时序神经网络都有哪些时序网络模型_神经网络_06$ ，有多少比例保存到本时刻 $时序神经网络都有哪些时序网络模型_神经网络_07$ 。
输出门：控制单元状态 $时序神经网络都有哪些时序网络模型_神经网络_07$ 对单元输出 $时序神经网络都有哪些时序网络模型_神经网络_09$ 的贡献比例。

具体的一些表达式，可以根据图片写出：
遗忘门gating rate和输入门gating rate的比率，均由输入信息和上一时刻的输出 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_10$ 共同决定：
$时序神经网络都有哪些时序网络模型_pytorch_11$
输入网络状态的信息，由输入信息和上一时刻的输出 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_10$ 共同决定：
$时序神经网络都有哪些时序网络模型_自然语言处理_13$
更新网络状态的信息，来自上一时刻的状态 $时序神经网络都有哪些时序网络模型_神经网络_14$ 和本时刻的输入信息 $时序神经网络都有哪些时序网络模型_hg_15$ 决定，并加入gating的信息机制；
$时序神经网络都有哪些时序网络模型_神经网络_16$
其中 $时序神经网络都有哪些时序网络模型_神经网络_17$ 是哈达玛积（也就是所谓的element-wise product)

输出门的gating rate，也由输入信息和上一时刻的输出 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_10$ 共同决定

$时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_19$

最后隐藏状态，或者说输出是对单元状态 $时序神经网络都有哪些时序网络模型_自然语言处理_03$ 激活，并gating后输出

$时序神经网络都有哪些时序网络模型_hg_21$

以上就是单层的LSTM单元的输出，如果是多层的输出，那么可能看起来就是下面这个样子：

时序神经网络都有哪些时序网络模型_神经网络_22

此图中上下两层绿色的模块就是代表上述的单层LSTM网络，上面那层LSTM的输入就是下面那层LSTM的输出。也就是说， $时序神经网络都有哪些时序网络模型_自然语言处理_23$ 。第 $时序神经网络都有哪些时序网络模型_神经网络_24$ 层LSTM在t时刻的输入 $时序神经网络都有哪些时序网络模型_hg_25$ 。

事实上，我们在长期实践中可以体会到，关键的是gating的这种结构，而其中具体什么算子来产生gating的分数，或者对context进行如何的activate，区别不大。
因此，一般常用的RNN网络也就是LSTM和经过一些化简的GRU，后者参数更少，效果相当。

注意，在具体实现的时候，可以把input和上一时刻的hidden拼接起来后用大矩阵映射；而不是分别映射后再相加。这就是融合的不同方式，“先拼接后投影”或者“先投影，后相加”，其实一般来说，效果是差不多的，但两个方式参数量有显著区别，前者大，后者小。

2.一些RNN的Pytorch实现：

当然，如果想要发明新的网络架构，那么还是得学学BPTT，但我不用~~因此就不学。

2.1 LSTMcell复写

以下我们可以复写一下LSTM单元的结构，也就是对应于上面那个多层LSTM的一个绿色模块。

import torch.nn as nn
import torch

class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size, cell_size, output_size):
        super(LSTMCell, self).__init__()
        self.hidden_size = hidden_size
        self.cell_size = cell_size
        self.gate = nn.Linear(input_size + hidden_size, cell_size)
        self.output = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
        #self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden, cell):
        combined = torch.cat((input, hidden), 1)
        f_gate = self.sigmoid(self.gate(combined))
        i_gate = self.sigmoid(self.gate(combined))
        o_gate = self.sigmoid(self.gate(combined))
        z_state = self.tanh(self.gate(combined))
        cell = torch.add(torch.mul(cell, f_gate), torch.mul(z_state, i_gate))
        hidden = torch.mul(self.tanh(cell), o_gate)
        output = self.output(hidden)
        # output = self.softmax(output)这个在pytorch中是没有的
        return output, hidden, cell

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

    def initCell(self):
        return torch.zeros(1, self.cell_size)

如果使用多层LSTM，我们可以写一个将LSTMcell包装起来的模块。事实上，pytorch也是这么做的。

tips：Pytorch中的torch.add和torch.mul都是element-wise的操作,请勿弄错。
矩阵乘法使用：torch.mm；batch个矩阵乘法是torch.bmm

2.2 GRUcell复写

相比LSTM，GRU只有一个隐藏状态，但是也使用了比较精心设计的gating方法。

class GRUCell(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(GRUCell, self).__init__()
        self.hidden_size = hidden_size
        self.gate = nn.Linear(input_size + hidden_size, hidden_size)
        self.output = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
        #self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)        
        z_gate = self.sigmoid(self.gate(combined))
        r_gate = self.sigmoid(self.gate(combined))
        combined01 = torch.cat((input, torch.mul(hidden,r_gate)), 1)  
        h1_state = self.tanh(self.gate(combined01))
        
        h_state = torch.add(torch.mul((1-z_gate), hidden), torch.mul(h1_state, z_gate))
        output = self.output(h_state)
        #output = self.softmax(output)
        return output, h_state

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

3.LSTM的Pytorch使用：

我们打开Pytorch的官方文档学一学。官方的部署一般都很齐全，但有些时候会让人看的很晕。LSTM模块实现的是自动将三维的输入张量（batch,seq,feature），按照seq时序输入完毕。此外还有没有包装的LSTMCell，是没有 $时序神经网络都有哪些时序网络模型_自然语言处理_26$ 长度这一维的，需要我们手动操作了。

3.1 初始化参数：

input_size：输入feature $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_04$ 的维数
hidden_size：hidden state $时序神经网络都有哪些时序网络模型_神经网络_28$ 的维数，也是cell state的维数
num_layers：stacked LSTM层数，默认是1
bias:即是否给模块的线性映射加上偏置，默认是加上偏置；
batch_first，即前馈输入的时候是 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_29$ 还是， $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_30$ 。默认是False，也就是默认输入数据格式是后者；
dropout:即是否对多层LSTM的输入加dropout，但对最后的输出不加dropout。默认是dropout=0，不加dropout；
bidrectional：即LSTM是否为双向序列建模，默认是False，也就是单向的；
proj_size：用于改变hidden state的维度，使得 $时序神经网络都有哪些时序网络模型_神经网络_09$ 和 $时序神经网络都有哪些时序网络模型_hg_32$ 维度不同。如果参数声明>0的话， $时序神经网络都有哪些时序网络模型_神经网络_09$ 的维度会从原本的hidden_size变为proj_size，因此对于每一个隐层输出，都需要进行一次投影 $时序神经网络都有哪些时序网络模型_神经网络_34$

如果是双层的LSTM，那么LSTMcell的样子大概如下所示：

时序神经网络都有哪些时序网络模型_hg_35

双向的信息，如hidden_state和Cell_state都会被拼接起来，进行后续的操作。

注意，这里的dropout参数是这样的：当LSTM层数大于等于2的时候，对 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_36$ 使用 $时序神经网络都有哪些时序网络模型_神经网络_37$ ，这是一个服从伯努利分布的dropout变量

dropout平时很多用，但到底做了什么还真不记得了。。。到时候补上

3.2 输入参数：

输入input：是一个三维张量，形状是（seqLen，batchsize，input_size）或者（batchsize，seqLen，input_size），这取决于是否设置batch_first=True。
hiddenstate的初始化值 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_38$ ，默认是0初始化，否则形状是 $时序神经网络都有哪些时序网络模型_pytorch_39$ ，其中当 $时序神经网络都有哪些时序网络模型_自然语言处理_40$ , $时序神经网络都有哪些时序网络模型_神经网络_41$ ,否则就是 $时序神经网络都有哪些时序网络模型_pytorch_42$ 。
cell state的初始化值 $时序神经网络都有哪些时序网络模型_hg_43$ ，默认是0初始化，否则形状是 $时序神经网络都有哪些时序网络模型_hg_44$ ， $时序神经网络都有哪些时序网络模型_自然语言处理_45$
注意一下，似乎只有当batch_first的时候，输入参数的要求才是上面那个样子的，不然似乎不是。

3.3 输出参数：

输出内容是output,(h_n,c_n)，三个三维张量。

这里output的就是时序输入中每一时刻的最后那一层LSTMCell的输出；形状为 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_46$ 或者 $时序神经网络都有哪些时序网络模型_自然语言处理_47$ ，前者是batch_first为否的时候，后者是batch_first为是的时候。

h_n和c_n的形状分别是 $时序神经网络都有哪些时序网络模型_hg_48$ 和 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_49$ ，其中输出维度的参数取决于是否声明 $时序神经网络都有哪些时序网络模型_pytorch_50$

时序神经网络都有哪些时序网络模型_自然语言处理_51

这里绿色的输出相当于是outputs，而红色框中是两个hidden_states，是包含各个层次，各个时序的两个隐藏信息的（当然红框也不太恰当，下面的红框内没把Cell state包括进来）。

其中，不知道分别取出两个方向的各自的张量是不是采用以下方式就行，即前向的只需取用对应维度的0，后向为1，说明的原文如下：

For bidirectional LSTMs, forward and backward are directions 0 and 1 respectively. Example of splitting the output layers when batch_first=False: output.view(seq_len, batch, num_directions, hidden_size)

3.4 实例示范

官方指导给的很简单，就是单独的模块使用：

rnn = nn.LSTM(10, 20, 2)
input = torch.randn(5, 3, 10)
h0 = torch.randn(2, 3, 20)
c0 = torch.randn(2, 3, 20)
output, (hn, cn) = rnn(input, (h0, c0))

我们这里复杂一点，和之前Transformer训练时不一样。这里每次Seqlen=1，需要反复的进行decoding操作，来得到我们的下一时刻的输入，是一个自回归的操作。注意，这里我们的优化过程中，如果还是采用标准的交叉熵作为loss function，是不会出现梯度不可导的情况。因为我们的标签和输出在操作的时候，步步都可导；虽然我们在得到下一时刻的输出的时候，有一步采样的过程，但是我们仅仅是得到了下一时刻的输入，并且用于前馈推理，并不影响我们得到loss的这一个过程，这和我们在TextGAN中不一样，不要混淆了。

我们复写一下之前VQG提到的一个IVQA的模型部分，这里数据输入和输出以及整体训练的过程就另说了。

import torch
import torch.nn as nn

import numpy as np
import random

class FFN(nn.Module):
    def __init__(self,input_dim,hid_dim,activator):
        super(FFN, self).__init__()
        self.Wih=nn.Linear(input_dim,hid_dim)
        self.activate=activator

    def forward(self, src1):
        return self.activate(self.Wih(src1))

class FFN_2(nn.Module):
    def __init__(self,input_dim1,input_dim2,hid_dim,activator):
        super(FFN_2, self).__init__()
        self.Wih=nn.Linear(input_dim1,hid_dim)
        self.Wah=nn.Linear(input_dim2,hid_dim)
        self.activate=activator

    def forward(self, src1,src2):
        return self.activate(torch.add(self.Wih(src1),self.Wah(src2)))
    #就是个单层的感知机，只不过这里我们不把两个输入拼接起来，采取先映射再相加的融合方式；

class MLBpool(nn.Module):

    def __init__(self,input_dim1,input_dim2,hid_dim,dropout=0):
        super(MLBpool, self).__init__()
        self.Wleft=nn.Linear(input_dim1,hid_dim)
        self.Wright=nn.Linear(input_dim2,hid_dim)
        self.Uout=nn.Linear(hid_dim,hid_dim)

        self.activateL=nn.Tanh()
        self.activateR=nn.Tanh()
        self.activateOut=nn.Tanh()

    def forward(self, src1,src2):
        #src1:Batch,K,inputdim1
        #src2:Batch,1,inputdim2
        K=src1.shape[1]
        src2=src2.repeat(1,K,1)
        inner=torch.mul(self.activateL(self.Wleft(src1)),self.activateR(self.Wright(src2)))
        #torch.mul是哈达玛求积吗？似乎是的；
        out=self.activateOut(self.Uout(inner))
        #这里应该是对的吧。
        #其实输出维度是随意的，output:(Batch,K,output_dim=hid_dim)
        return out

class CoAttLayer(nn.Module):
    def __init__(self,emb_dim,hid_dim,visual_dim,out_dim,dropout=0):
        super(CoAttLayer, self).__init__()
        self.TContext=FFN_2(hid_dim,emb_dim,hid_dim,nn.ReLU()) #出Zt
        self.MlbVT=MLBpool(visual_dim,hid_dim,hid_dim)     #出fij
        self.actAtt=nn.Softmax(dim=1) 
        self.Toscore=nn.Linear(hid_dim,1) #fij to alpha ij
        self.MlbCZ=MLBpool(visual_dim,hid_dim,hid_dim)    #将两个混合向量再做双线性融合

        #要求输出大小是，batch*hid_dim;out_dim似乎没用了

    def forward(self, QAhidden,answer,visualF):
        Zt=self.TContext(QAhidden,answer)
        #Zt:Batch,Seqlen=1,hid_dim
        #Zt=torch.unsqueeze(Zt,1)
        #Zt:Batch,1,hid_dim
        fk=self.MlbVT(visualF,Zt)
        #visualF:Batch,K,Feature_len
        #fk:?
        #这里觉得不应该再用pT*fij,否则这样就不能处理不定长的序列了。
        Attscore=self.actAtt(self.Toscore(fk))
        # Attscore:Batch,K,1

        Context=torch.sum(torch.mul(Attscore,visualF),dim=1)
        #Context:Batch,1,Feature_len
        Context=torch.unsqueeze(Context,dim=1)
        gt=self.MlbCZ(Context,Zt)
        #gt:Batch,1,hiddim
        return gt


class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim,visual_shape,dropout=0,n_layers=1,maxlen=15):
        super(Decoder, self).__init__()
        #output_dim=词表大小；emb_dim，词语embedding大小；hid_dim是词向量隐藏层；visual_shape=K*feature长
        self.out_dim=output_dim #词表大小
        self.hid_dim = hid_dim  #LSTM隐藏层维度
        self.embedding = nn.Embedding(output_dim, emb_dim) #embedding矩阵，output_dim是词表大小，输出emb_dim
        # 其实emb_dim对于之后的网络就相当于input_dim
        # visual_shape是36*2048,也就是一张图的特征数
        self.n_layers = n_layers #LSTM的层数
        self.Init_glimpse=FFN_2(emb_dim,emb_dim,hid_dim,nn.Tanh()) #初始化LSTM的hidden_state
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout,batch_first=True) #用于decoder的序列信息捕捉：
        self.coAtt=CoAttLayer(emb_dim,hid_dim,visual_shape[1],visual_shape[0]) #输入LSTM的隐状态，输出gt
        self.predictor=FFN(hid_dim,output_dim,nn.Softmax(dim=1)) #输入g_t，输出词语预测的概率,注意一下softmax的维度还没写！
        self.maxlen=maxlen #句子解码的最长长度，
        

    def forward(self, visualF,conceptI,answer,trg,teacher_forcing_ratio=0): 
        #先做初始化
        #先不考虑teacher_forcing
        # visualF:(batch,K=36,Feature_dim=2048)
        # ConceptI:(batch,),暂时就一个样本一个concept，其实应该是有多个的；再商榷吧
        # answer:(Batch,),注意答案和图片是匹配的，但batch里各样本都是独立的。

        conceptI=self.embedding(conceptI)
        answer=self.embedding(answer)
        #conceptI和answer：(batch,embedding_dim)

        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        # trg：(batch,maxlen)
        outputs = torch.zeros(trg_len, batch_size, self.out_dim)
        #Batch*maxlen*词表大小
        tops=torch.zeros(trg_len,batch_size)

        # 生成batch个BOS，作为初始化x_0;【BOS】token就认为是0吧，还有【EOS】和padding,应该没有unknown
        #x_t=torch.zeros(batch_size).long()
        x_t=torch.LongTensor(np.random.randint(low=0,high=wordnum-1,size=(batchsize,)))
        #x_t:(Batch,)
        hidden=self.Init_glimpse(conceptI,answer)
        # hidden：(Batch,hidden_dim)，注意我们这里Seqlen=1...，要注意维度操作。。
        hidden=torch.unsqueeze(hidden,0)
        # hidden:(1,Batch,hidden_dim)
        cell=torch.zeros(1,batch_size,self.hid_dim)
        # cell:(1,Batch,hidden_dim)
        conceptI=torch.unsqueeze(conceptI,1)
        answer=torch.unsqueeze(answer,1)
        #conceptI和answer：(batch,1,embedding_dim)

        for t in range(trg_len):
            x_t=torch.unsqueeze(x_t,1)
            x_t=self.embedding(x_t)
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, (hidden, cell) = self.rnn(x_t, (hidden, cell))
            
            #output ：(Batch,Seqlen=1,1*Out_dim)，这里是(batch,1,hid_dim)
            g_t=self.coAtt(output,answer,visualF)
            #(Batch,1,hidden)，其实也不一定，但是偷懒就都是hidden了
            g_t=torch.squeeze(g_t,dim=1)

            output=self.predictor(g_t)
            #output:(Batch,词表大小)

            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            #top1 = (batch,)
            tops[t]=top1 #实际上推理用，训练时不用；
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            x_t = trg[t] if teacher_force else top1.long()

        return outputs,tops

def train(model, visualFeatures,I_s, answer,target,num_epochs=3000):

    optim = torch.optim.Adamax(model.parameters())
    TRG_PAD_IDX=0
    criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX) #忽略无用标签
    print(target.permute(1,0))

    for epoch in range(num_epochs):
        epoch_loss = 0

        result,tops=model(visualF=visualFeatures,conceptI=I_s,answer=answer,trg=target)
        output_dim = result.shape[-1]

        result=result.view(-1, output_dim)
        trg=target.view(-1)

        loss = criterion(result, trg)
        loss.backward()

        optim.step()
        optim.zero_grad()

        #total_loss += loss.data[0] * v.size(0) # 似乎loss已经是个标量?
        epoch_loss += loss.item()


        if epoch % 500==0:
            print("epoch loss:{}".format(epoch_loss))
            print(tops.permute(1,0))

if __name__ == '__main__':
    wordnum=10
    embdim=100
    hiddim=40
    visual_=(5,50)
    maxlen=14

    model=Decoder(output_dim=wordnum,emb_dim=embdim,hid_dim=hiddim,visual_shape=visual_)
    batchsize=2
    visualFeatures=torch.randn([batchsize,visual_[0],visual_[1]])
    I_s=torch.LongTensor(np.random.randint(low=0,high=wordnum-1,size=(batchsize,)))
    answer=torch.LongTensor(np.random.randint(low=0,high=wordnum-1,size=(batchsize,)))

    target=torch.LongTensor(np.random.randint(low=1,high=wordnum,size=(maxlen,batchsize)))
    #提醒，计算交叉熵的时候，target是序号矩阵；如果我们要labelsmoothing得自己写一个交叉熵
    
    train(model,visualFeatures,I_s,answer,target)
    #result=model(visualF=visualFeatures,conceptI=I_s,answer=answer,trg=target)
    #visualFeature:(Batch,K,feature_dim)
    #concept:(batch)
    #answer:(batch)
    #target:(batchsize,maxlen)

我们看一下超级小样本下的结果：

target:tensor([[3, 5, 3, 6, 7, 5, 9, 4, 3, 8, 1, 2, 1, 2],
        [5, 5, 6, 6, 2, 6, 5, 9, 7, 4, 3, 2, 2, 5]])
3000个iteration后
epoch loss:1.6827417612075806
tensor([[3., 5., 5., 5., 5., 5., 9., 4., 3., 8., 1., 2., 1., 2.],
        [5., 5., 6., 6., 2., 6., 5., 9., 7., 6., 2., 2., 5., 5.]])

基本拟合了这个2*14的小样本数据。

3.5 其它操作

矩阵的维度转置：
tensor.permute()方法，例如原本的矩阵a是四维矩阵，那么
a.permute(0,3,1,2)就是，把原本的（0，1，2，3）的位置，换到（0，3，1，2）,就是说原本的元素 $时序神经网络都有哪些时序网络模型_自然语言处理_52$ 变为 $时序神经网络都有哪些时序网络模型_时序神经网络都有哪些_53$
矩阵的元素广播：
tensor.repeat()方法，例如
a.repeat(1,k,1)，就是在矩阵的第二维重复k遍，例如原本是[…,[[1,2,3,4,5]],…]，现在就变为[…,[[1,2,3,4,5],…k个…,[1,2,3,4,5],],…]