文本数据的序列性使得RNN的循环迭代模式成为显而易见的选择,但如果我们把文本编码后的结果(Batch×sequence×embedding)看做一张图片,那么通过卷积的方式提取文本信息也理所当然。这就是TextCNN算法的初衷。
TextCNN是一种高效的文本卷积算法,其可以捕捉相邻文本间的局部结构关系,同时卷积的特性又使得其支持并行操作。该算法在文本分类问题上的效果与TextRNN算法相当,因此被广泛使用。
那么如何从图片编码的角度,来合理的看待文本数据编码呢?这里提供两种视角:
视角1: 宽度为1的长条状图片,其embedding的尺寸可视为图片的channel大小。
对此,可直接应用1维卷积层提取信息。视角2:channel为1的图片,其高度和宽度分别对应于sequence和embedding。
对此,需要使用2维卷积提取信息。其效果如下图所示。
无论从哪种视角思想,TextCNN中的卷积核都是在文本分词方向(即sequence方向)的扫描。其每一组卷积核的参数(包括个数、扫面窗口大小、步长等)均不一样,因此生成的特征图尺寸并不一致。在此基础上,通过对sequence维度进行1维的最大池化处理,使得每个卷积核均只输出一个结果,这样便可以对这些卷积核进行合并,最后交给最后的全连接层进行分类。
由上可见,整个TextCNN的网络架构为(省略了激活函数、dropout和BN处理):
Embedding——>1D-CNN/2D-CNN——>1D-MaxPooling——>Channel Merge——>FC——>分类结果
下面分别给出了基于1维卷积和2D卷积的网络架构,各层间维度的转换关系见代码中的备注:
- 1D-CNN示例
import torch
import torch.nn as nn
import torch.nn.functional as F
Config = {"vob_size": 5000, # 字典尺寸
"ebd_size": 100, # 词嵌入维度
"conv1D_out": [8, 8, 8], # 1D-conv层的output-channel列表
"conv1D_ker": [2, 3, 4], # 1D-conv层的kernel尺寸列表
"fc_cla": 4, # 全连接层的输出类别
"dropout": 0.5 # dorpout层参数
}
class Text1DCNN(nn.Module):
def __init__(self):
super(Text1DCNN, self).__init__()
self.embedding = nn.Embedding(num_embeddings=Config['vob_size'], embedding_dim=Config['ebd_size'])
self.conv1D = nn.ModuleList([nn.Conv1d(in_channels=Config['ebd_size'], out_channels=out, kernel_size=ker)
for out, ker in zip(Config['conv1D_out'], Config['conv1D_ker'])])
self.dropout = nn.Dropout(p=Config['dropout'])
self.fc = nn.Linear(sum(Config['conv1D_out']), Config["fc_cla"])
def forward(self, x): # x: (batch, sequence)
x = self.embedding(x) # x: (batch, sequence, embed)
x = x.permute(0, 2, 1) # x :(batch, embed, sequence) 将embed视为in_channel,这样才能进行1维卷积
x = [F.relu(conv1D(x)) for conv1D in self.conv1D] # [(batch, out_channel, L_out)]
x = [F.max_pool1d(i, i.size(-1)) for i in x] # [(batch, out_channel, 1)],在最后一个维度上进行max_pooling
x = [torch.squeeze(i, dim=-1) for i in x] # [(batch, out_channel)],维度压缩
x = torch.cat(x, dim=-1) # (batch, total_out_channel), 沿着各out_channel进行拼接
x = self.dropout(x)
out = self.fc(x)
return out
- 2D-CNN示例
import torch
import torch.nn as nn
import torch.nn.functional as F
Config = {"vob_size": 5000, # 字典尺寸
"ebd_size": 100, # 词嵌入维度
"conv1D_out": [8, 8, 8], # 1D-conv层的output-channel列表
"conv1D_ker": [2, 3, 4], # 1D-conv层的kernel尺寸列表
"fc_cla": 4, # 全连接层的输出类别
"dropout": 0.5 # dorpout层参数
}
class Text2DCNN(nn.Module):
def __init__(self):
super(Text2DCNN, self).__init__()
self.embedding = nn.Embedding(num_embeddings=Config['vob_size'], embedding_dim=Config['ebd_size'])
self.conv2D = nn.ModuleList([nn.Conv2d(in_channels=1, out_channels=out, kernel_size=(ker, Config['ebd_size']))
for out, ker in zip(Config['conv1D_out'], Config['conv1D_ker'])])
self.dropout = nn.Dropout(p=Config['dropout'])
self.fc = nn.Linear(sum(Config['conv1D_out']), Config["fc_cla"])
def forward(self, x): # x: (batch, sequence)
x = self.embedding(x) # x: (batch, sequence, embed)
x = torch.unsqueeze(x, dim=1) # x: (batch, 1, sequence, embed) 增加in_channel维度,此时将embed维度视为width
x = [F.relu(conv2D(x)) for conv2D in self.conv2D] # x: [(batch, out_channel, height_out, 1)]
x = [torch.squeeze(i, dim=-1) for i in x] # x: [(batch, out_channel, height_out)], 对width维度进行压缩
x = [F.max_pool1d(i, i.size(-1)) for i in x] # x: [(batch, out_channel, 1)], 对height_out维度上进行1D-max pooling
x = [torch.squeeze(i, dim=-1)for i in x] # x: [(batch, out_channel)], 对height_out维度压缩
x = torch.cat(x, dim=-1) # x: (batch, total_out_channel), 对height_out维度合并
x = self.dropout(x)
out = self.fc(x)
return out