SSD全称为Single Shot MultiBox Detector,为one-stage的目标检测算法。与two-stage的目标检测算法不同,SSD完全消除了Proposal的生成过程,将所有的计算统一到一个Network中。并且,其在不同尺度的feature maps上输出bounding boxes,以此来应对目标检测中物体尺寸大小不一的问题。

与同为one-stage目标检测的Yolo v1相比,SSD采用CNN来直接对bounding box以及class进行预测,而不是像Yolo v1那样在fully connected layer后进行预测。




SSD模型pytorch SSD模型网络结构_卷积


图为SSD与Yolo v1网络结构对比,图出自SSD原始论文

总结为:

  1. SSD的核心在于,在不同尺度的特征图上得到固定数目的先验框(Prior boxes/ default boxes/ anchors),在这些先验框上使用小的卷积核来预测category scores和box offsets. 靠前的特征图检测小物体(感受野较小),靠后的特征图检测大物体(感受野较大)。
  2. 为了得到更高的准确率,使用了不同尺度的feature maps,并且在这些feature maps上的先验框尺度和长宽比也不同。
  3. 该模型为end-to-end 模型,可以更好地训练。

网络模型:

SSD基于前向的CNN网络,对于每个anchor,都输出bounding box prediction和category score,得到一个固定大小的输出集合,最后使用NMS得到最终预测结果。

Base Model

SSD的网络模型可分为三部分,前部分的网络为标准的图像分类网络(在全连接层之前截断),作者称之为base network. 在论文中作者使用了VGG16作为base network。

Multi-scale feature maps for detection

而在网络的中间部分,即VGG网络的后面,作者添加了几个卷积层,并且得到了不同尺度的feature maps,这些feature maps减小了size,使网络可在不同尺度下进行预测。文中作者使用的特征图分别为:conv4_3, conv7, conv8_2, conv9_2, conv10_2, conv11_2。

convolutional predictors for detection

在网络后的最后部分, 针对用来检测的feature maps,使用卷积层来得到输出值。即,对于每个先验框,都输出一套独立的检测值:各个类别的置信度:c个;bounding box的offset:4个。假设feature maps的大小为m*n*p,则使用一个大小为3*3*p的卷积核对其进行卷积,输出一个output value,可以作为其中一类的category score prediction或是一个shape offset regression。每个单元上的先验框个数为k。那么每个单元共有(c+4)*k个预测值,因此需要(c+4)*k个卷积核来完成检测过程。

构建三部分网络的代码如下(Pytorch版)。

基于VGG的base network.


def vgg(cfg, i, batch_norm=False):
    layers = []
    in_channels = i

    #此时的cfg为[64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M',
    #       512, 512, 512]
    #       i = 3
    for v in cfg:
        if v == 'M':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        elif v == 'C':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2, ceil_mode=True)]
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            if batch_norm:
                layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
            else:
                layers += [conv2d, nn.ReLU(inplace=True)]
            in_channels = v
    pool5 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
    conv6 = nn.Conv2d(512, 1024, kernel_size=3, padding=6, dilation=6)
    conv7 = nn.Conv2d(1024, 1024, kernel_size=1)
    layers += [pool5, conv6,
               nn.ReLU(inplace=True), conv7, nn.ReLU(inplace=True)]
    return layers


增加的Extra Convolutional Layers:


def add_extras(cfg, i, batch_norm=False):
    # Extra layers added to VGG for feature scaling

    # 此时的cfg为[256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256]
    # i = 1024
    layers = []
    in_channels = i
    flag = False
    for k, v in enumerate(cfg):
        if in_channels != 'S':
            if v == 'S':
                layers += [nn.Conv2d(in_channels, cfg[k + 1],
                           kernel_size=(1, 3)[flag], stride=2, padding=1)]
            else:
                layers += [nn.Conv2d(in_channels, v, kernel_size=(1, 3)[flag])]
            flag = not flag
        in_channels = v
    return layers


增加完Convolutional Layers后,将不同尺度的feature maps通过convolutional predictors,输出detection的结果。


def multibox(vgg, extra_layers, cfg, num_classes):

    # multibox的输入为之前的vgg网络以及extra_layers网络
    # 此时的cfg为[4, 6, 6, 6, 4, 4]
    loc_layers = []
    conf_layers = []
    vgg_source = [21, -2]

    # loc_layers为conv4_3, conv7, conv8_2, conv9_2, conv10_2, conv11_2
    # 对应的接下来的生成localization数值的卷积层
    # conf_layers为这些特征图对应的接下来生成confidence数值的卷积层
    for k, v in enumerate(vgg_source):
        loc_layers += [nn.Conv2d(vgg[v].out_channels,
                                 cfg[k] * 4, kernel_size=3, padding=1)]
        conf_layers += [nn.Conv2d(vgg[v].out_channels,
                        cfg[k] * num_classes, kernel_size=3, padding=1)]
    for k, v in enumerate(extra_layers[1::2], 2):
        loc_layers += [nn.Conv2d(v.out_channels, cfg[k]
                                 * 4, kernel_size=3, padding=1)]
        conf_layers += [nn.Conv2d(v.out_channels, cfg[k]
                                  * num_classes, kernel_size=3, padding=1)]
    return vgg, extra_layers, (loc_layers, conf_layers)


全部构建的网络为:


class SSD(nn.Module):
    """Single Shot Multibox Architecture
    The network is composed of a base VGG network followed by the
    added multibox conv layers.  Each multibox layer branches into
        1) conv2d for class conf scores
        2) conv2d for localization predictions
        3) associated priorbox layer to produce default bounding
           boxes specific to the layer's feature map size.
    See: https://arxiv.org/pdf/1512.02325.pdf for more details.

    Args:
        phase: (string) Can be "test" or "train"
        size: input image size
        base: VGG16 layers for input, size of either 300 or 500
        extras: extra layers that feed to multibox loc and conf layers
        head: "multibox head" consists of loc and conf conv layers
    """

    def __init__(self, phase, size, base, extras, head, num_classes):
        super(SSD, self).__init__()
        self.phase = phase # phase为train/test
        self.num_classes = num_classes # object的类别数
        self.cfg = (coco, voc)[num_classes == 21] # configuration, voc为21类,coco为201类
                                                  # 因此当class为21时,选择voc的配置
        self.priorbox = PriorBox(self.cfg) # 根据configuration来设置先验框
        self.priors = Variable(self.priorbox.forward(), volatile=True)
        self.size = size # 输入图像的大小,在改程序中图像大小应为300

        # SSD network
        self.vgg = nn.ModuleList(base) # base为基础的vgg网络
        # Layer learns to scale the l2 normalized features from conv4_3
        self.L2Norm = L2Norm(512, 20)
        self.extras = nn.ModuleList(extras) # 论文中新加入的层

        self.loc = nn.ModuleList(head[0]) #head为输出localization与confidence的卷积层
        self.conf = nn.ModuleList(head[1])
        

        ### Non maximum suppression 在这里 layer/functions/detection.py
        if phase == 'test':
            self.softmax = nn.Softmax(dim=-1)
            self.detect = Detect(num_classes, 0, 200, 0.01, 0.45)

    def forward(self, x):
        """Applies network layers and ops on input image(s) x.

        Args:
            x: input image or batch of images. Shape: [batch,3,300,300].

        Return:
            Depending on phase:
            test:
                Variable(tensor) of output class label predictions,
                confidence score, and corresponding location predictions for
                each object detected. Shape: [batch,topk,7]

            train:
                list of concat outputs from:
                    1: confidence layers, Shape: [batch*num_priors,num_classes]
                    2: localization layers, Shape: [batch,num_priors*4]
                    3: priorbox layers, Shape: [2,num_priors*4]
        """
        sources = list() # source储存的是六个用于检测的feature map
                         # conv4_3, conv7, conv8_2, conv9_2, conv10_2, conv11_2
        loc = list()
        conf = list()

        # apply vgg up to conv4_3 relu

        # vgg[0-22]的输出为conv4_3 relu
        for k in range(23):
            x = self.vgg[k](x)

        s = self.L2Norm(x)
        sources.append(s)

        # apply vgg up to fc7
        # vgg网络后面添加的部分,即论文中的conv6, conv7
        for k in range(23, len(self.vgg)):
            x = self.vgg[k](x)
        # conv7存储在source中,作为后续要使用的feature map
        sources.append(x)

        # apply extra layers and cache source layer outputs
        # 论文中的Extra Feature Layers, conv8_1, conv8_2,
        # conv9_1, conv9_2, conv10_1, conv10_2, conv11_1, conv11_2
        # 其中
        for k, v in enumerate(self.extras):
            x = F.relu(v(x), inplace=True)
            if k % 2 == 1:
                ### conv8_2, conv9_2, conv10_2, conv11_2加入到source中,
                ### 作为后续要使用的feature map
                sources.append(x)

        # apply multibox head to source layers

        # 将sources中feature map对应的localization与confidence加入到loc, conf列表中
        
        for (x, l, c) in zip(sources, self.loc, self.conf):
            loc.append(l(x).permute(0, 2, 3, 1).contiguous())
            conf.append(c(x).permute(0, 2, 3, 1).contiguous())

        # 将loc, conf列表中的值串在一起
        loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1)
        conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1)

        
        # Test模式时,需要有NMS.
        """At test time, Detect is the final layer of SSD.  Decode location preds,
        apply non-maximum suppression to location predictions based on conf
        scores and threshold to a top_k number of output predictions for both
        confidence score and locations.
        """
        # 上面对detect的初始化为:self.detect = Detect(num_classes, bkg_label=0, 200, 0.01, 0.45)
        # num_classes : 类别数目
        # 0: background label, 背景的label为0.
        # 200: top-k. 过滤到只剩top-k个预测框。
        # 0.01: confidence threshold
        # 0.45: 做NMS时的threshold

        # Detect的Forward为:

        # def forward(self, loc_data, conf_data, prior_data):
        """
        Args:
            loc_data: (tensor) Loc preds from loc layers
                Shape: [batch,num_priors*4]
            conf_data: (tensor) Shape: Conf preds from conf layers
                Shape: [batch*num_priors,num_classes]
            prior_data: (tensor) Prior boxes and variances from priorbox layers
                Shape: [1,num_priors,4]
        """
        if self.phase == "test":
            output = self.detect(
                loc.view(loc.size(0), -1, 4),                   # loc preds
                self.softmax(conf.view(conf.size(0), -1,
                             self.num_classes)),                # conf preds
                self.priors.type(type(x.data))                  # default boxes
            )


        else:
            output = (
                loc.view(loc.size(0), -1, 4), # 展成为4的数值
                conf.view(conf.size(0), -1, self.num_classes), # 展成为num_classes的数值
                self.priors
            )
        return output

    def load_weights(self, base_file):
        other, ext = os.path.splitext(base_file)
        if ext == '.pkl' or '.pth':
            print('Loading weights into state dict...')
            self.load_state_dict(torch.load(base_file,
                                 map_location=lambda storage, loc: storage))
            print('Finished!')
        else:
            print('Sorry only .pth and .pkl files supported.')


以上为构建SSD网络的代码。在该论文中,使用的图像大小为300,因此最终得到网络的代码为:


base = {
    '300': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'C', 512, 512, 512, 'M',
            512, 512, 512],
    '512': [],
}
extras = {
    '300': [256, 'S', 512, 128, 'S', 256, 128, 256, 128, 256],
    '512': [],
}
mbox = {
    '300': [4, 6, 6, 6, 4, 4],  # number of boxes per feature map location
    '512': [],
}


def build_ssd(phase, size=300, num_classes=21):
    if phase != "test" and phase != "train":
        print("ERROR: Phase: " + phase + " not recognized")
        return
    if size != 300:
        print("ERROR: You specified size " + repr(size) + ". However, " +
              "currently only SSD300 (size=300) is supported!")
        return
    base_, extras_, head_ = multibox(vgg(base[str(size)], 3),
                                     add_extras(extras[str(size)], 1024),
                                     mbox[str(size)], num_classes)
    return SSD(phase, size, base_, extras_, head_, num_classes)


在训练过程中,首先要进行先验框匹配。即要确定训练的ground truth与哪个先验框匹配,从而该先验框对应的边界框负责预测它。

实际上,先验框匹配是在所有先验框已经输出对应的预测值:localization prediction 以及category confidence之后进行匹配的,在SSD.Pytorch的代码中,先验框匹配和hard negative mining均在MultiBoxLoss中完成。

原文中为:"...ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs." 只有当进行了先验框匹配之后,loss function和BP才能端到端的进行。同时,在训练过程中," training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies."

一条一条来说,首先介绍先验框的匹配策略。要确定哪个先验框与ground truth匹配从而训练网络。对于每个ground truth的box,在default boxes的集合中选择与其IOU最大的先验框(The best jaccard overlap)来匹配。然而由于ground truth box数量很少,这样单纯的匹配方式会使得正负样本数量非常不均衡,因此在SSD中,对于未匹配的先验框,若它与某个ground truth box的IOU大于设定的阈值,也将其视为匹配。作者认为这样简化了学习的过程。让网络来预测多个有overlapping的高得分先验框,而不是只预测有着最高评分的一个先验框。

在先验框匹配完后,可写出SSD的损失函数。


为指示函数,表示第i个先验框与class label为p的第j个ground truth box是否匹配。根据上述解释,每个ground truth box都至少有一个与之匹配的先验框,因此有


。总体的损失函数为位置损失函数localization loss与置信度损失函数confidence loss的加权和:



其中,N为已匹配的先验框数量,即先验框的正样本数量。

位置损失函数


为预测的bounding box数值


和ground truth


之间的Smooth L1 Loss,用来回归的数值仍然是bounding box的offset,即bounding box regression。此时的位置损失函数具体形式为:



其中,


表示正样本中的第i个先验框,与之匹配的为第j个ground truth box,因此用指示函数来控制匹配,


为第i个先验框预测的位置参数(x,y,w,h)。注意式中Smooth L1 Loss中的参数为


,即ground truch box相对于default box的offset。




置信度损失函数confidence loss为Softmax loss:



Hard Negative Mining:由于先验框数量过多,而与ground truth box匹配的先验框较少,绝大多数先验框均被视为负样本,导致正负样本不均衡。因此在SSD中使用了hard negative mining.

MultiBoxLoss的实现如下:


# -*- coding: utf-8 -*-
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from data import coco as cfg
from ..box_utils import match, log_sum_exp


class MultiBoxLoss(nn.Module):
    """SSD Weighted Loss Function
    Compute Targets:
        1) Produce Confidence Target Indices by matching  ground truth boxes
           with (default) 'priorboxes' that have jaccard index > threshold parameter
           (default threshold: 0.5).
        2) Produce localization target by 'encoding' variance into offsets of ground
           truth boxes and their matched  'priorboxes'.
        3) Hard negative mining to filter the excessive number of negative examples
           that comes with using a large number of default bounding boxes.
           (default negative:positive ratio 3:1)
    Objective Loss:
        L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N
        Where, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Loss
        weighted by α which is set to 1 by cross val.
        Args:
            c: class confidences,
            l: predicted boxes,
            g: ground truth boxes
            N: number of matched default boxes
        See: https://arxiv.org/pdf/1512.02325.pdf for more details.
    """

    # Jaccard index is same as IOU.
    def __init__(self, num_classes, overlap_thresh, prior_for_matching,
                 bkg_label, neg_mining, neg_pos, neg_overlap, encode_target,
                 use_gpu=True):
        super(MultiBoxLoss, self).__init__()
        self.use_gpu = use_gpu
        self.num_classes = num_classes
        self.threshold = overlap_thresh
        self.background_label = bkg_label
        self.encode_target = encode_target
        self.use_prior_for_matching = prior_for_matching
        self.do_neg_mining = neg_mining
        self.negpos_ratio = neg_pos
        self.neg_overlap = neg_overlap
        self.variance = cfg['variance']

    def forward(self, predictions, targets):
        """Multibox Loss
        Args:
            predictions (tuple): A tuple containing loc preds, conf preds,
            and prior boxes from SSD net.
                conf shape: torch.size(batch_size,num_priors,num_classes)
                loc shape: torch.size(batch_size,num_priors,4)
                priors shape: torch.size(num_priors,4)

            targets (tensor): Ground truth boxes and labels for a batch,
                shape: [batch_size,num_objs,5] (last idx is the label).
        """
        # Prediction为SSD整个网络的输出
        # localization模块的输出
        # Confidence模块的输出
        # 以及先验框
        loc_data, conf_data, priors = predictions
        num = loc_data.size(0)
        priors = priors[:loc_data.size(1), :]
        num_priors = (priors.size(0))
        num_classes = self.num_classes

        # match priors (default boxes) and ground truth boxes
        loc_t = torch.Tensor(num, num_priors, 4)
        conf_t = torch.LongTensor(num, num_priors)
        for idx in range(num):
            truths = targets[idx][:, :-1].data
            labels = targets[idx][:, -1].data
            defaults = priors.data
            match(self.threshold, truths, defaults, self.variance, labels,
                  loc_t, conf_t, idx)
        if self.use_gpu:
            loc_t = loc_t.cuda()
            conf_t = conf_t.cuda()
        # wrap targets
        # loc_t 为 1*8732*4,即变形后的预测框的localization prediction
        # conf_t 为 1*8732,即预测框的confidence prediction
        loc_t = Variable(loc_t, requires_grad=False)
        conf_t = Variable(conf_t, requires_grad=False)

        
        pos = conf_t > 0 # 统计先验框
        num_pos = pos.sum(dim=1, keepdim=True)

        # Localization Loss (Smooth L1)
        # Shape: [batch,num_priors,4]
        pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
        loc_p = loc_data[pos_idx].view(-1, 4) # 正样本的localization prediction
        loc_t = loc_t[pos_idx].view(-1, 4)    # 对应target的localization coordinate
        # 根据loc_p, loc_t计算localization loss
        loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False)

        # Compute max conf across batch for hard negative mining
        
        # 按照batch顺序将每个先验框得到的confidence score排序
        batch_conf = conf_data.view(-1, self.num_classes) 
        loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1)) # the size -1 is inferred from other dimensions

        # Hard Negative Mining
        loss_c[pos] = 0  # filter out pos boxes for now
        loss_c = loss_c.view(num, -1)
        _, loss_idx = loss_c.sort(1, descending=True)
        _, idx_rank = loss_idx.sort(1)
        num_pos = pos.long().sum(1, keepdim=True)
        num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)
        neg = idx_rank < num_neg.expand_as(idx_rank)

        # Confidence Loss Including Positive and Negative Examples
        pos_idx = pos.unsqueeze(2).expand_as(conf_data)
        neg_idx = neg.unsqueeze(2).expand_as(conf_data)
        conf_p = conf_data[(pos_idx+neg_idx).gt(0)].view(-1, self.num_classes)
        targets_weighted = conf_t[(pos+neg).gt(0)]
        loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)

        # Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / N

        N = num_pos.data.sum()
        loss_l /= N
        loss_c /= N
        return loss_l, loss_c