总体架构1

ROI对从RPN中选出来的1000个Proposal Boxes,以及从FPN中输出的多层特征图进行ROI Pool,对于box中的对象进行分类,并再次进行Proposal Boxes偏移(offset/delta)数值回归,产生新的分数和再次微调的box,以及得到标签,最后再次进行非极大值抑制(NMS):

resnet50使用要求 resnet50 fpn_List


基于FPN的ROI处理会比传统的Faster RCNN多出一些步骤,要更加复杂一些。

主要包含如下步骤:

  1. Box ROI Pool,根据1000个Proposal box的面积,确定选择在哪一层特征图上进行ROI Pool操作
  2. Box Head,由两个全连接层组成,对ROI Align处理出来的7x7的bounding-box所包含的特征图进一步处理
  3. Box Predicator,在Box Head处理得到的结果在进一步进行分类和Box的位置偏移(offset)做数值回归
  4. Postprocess Detection,做Softmax,进行最后分类,并将Box的位置偏移回归结果和Proposal boxes进行合并,得到调整后的detection boxes,最后进行极大值抑制(NMS)过滤出有效的detection结果(scores, boxes和labels)。

Box ROI Pool

本模型可以同时对多个图像进行处理,分别检测出各个图片中对象,所以首先需要通过convert_to_roi_format将各个图像的每层特征图合并在一起,统一进行ROI Align处理。

setup_scales则对输出的前4个特征图(最小的pool层不用),对mapper对象进行配置。Mapper是FPN中引入的一个新的概念,主要是计算Proposal Box的面积,并根据面积算出在哪一个特征图层进行ROI Align处理,具体可以参考论文中:

resnet50使用要求 resnet50 fpn_pytorch_02


对应的实现代码为:

class LevelMapper(object):
    """Determine which FPN level each RoI in a set of RoIs should map to based
    on the heuristic in the FPN paper.

    Arguments:
        k_min (int)
        k_max (int)
        canonical_scale (int)
        canonical_level (int)
        eps (float)
    """

    def __init__(self, k_min, k_max, canonical_scale=224, canonical_level=4, eps=1e-6):
        # type: (int, int, int, int, float) -> None
        self.k_min = k_min
        self.k_max = k_max
        self.s0 = canonical_scale
        self.lvl0 = canonical_level
        self.eps = eps

    def __call__(self, boxlists):
        # type: (List[Tensor]) -> Tensor
        """
        Arguments:
            boxlists (list[BoxList])
        """
        # Compute level ids
        s = torch.sqrt(torch.cat([box_area(boxlist) for boxlist in boxlists]))

        # Eqn.(1) in FPN paper
        target_lvls = torch.floor(self.lvl0 + torch.log2(s / self.s0) + torch.tensor(self.eps, dtype=s.dtype))
        target_lvls = torch.clamp(target_lvls, min=self.k_min, max=self.k_max)
        return (target_lvls.to(torch.int64) - self.k_min).to(torch.int64)

比如有一个Proposal Box对应宽高分别为:100, 120,那么根据上述公式:
resnet50使用要求 resnet50 fpn_pytorch_03
下表是ResNet50和FPN的对应关系,参考libtorch学习笔记(17)- ResNet50 FPN以及如何应用于Faster-RCNN

ResNet Layer Name

ResNet Level(k)

FPN Level

Minimum Area()

conv1

1

n/a

n/a

conv2_x

2

0

conv3_x

3

1

conv4_x

4

2

conv5_x

5

3

n/a

n/a

pool

所以这个proposal box会从Feature Map Level#0(2 - 2 = 0)中取出特征图进行RoI Align2处理。

Box Head

这部分包括两个全连接层,并用于后续的预测模块用来做分类和bouding-box delta预测:

class TwoMLPHead(nn.Module):
    """
    Standard heads for FPN-based models

    Arguments:
        in_channels (int): number of input channels
        representation_size (int): size of the intermediate representation
    """

    def __init__(self, in_channels, representation_size):
        super(TwoMLPHead, self).__init__()

        self.fc6 = nn.Linear(in_channels, representation_size)
        self.fc7 = nn.Linear(representation_size, representation_size)

    def forward(self, x):
        x = x.flatten(start_dim=1)

        x = F.relu(self.fc6(x))
        x = F.relu(self.fc7(x))

        return x

Box Predicator

这部分主要用来对1000个proposal boxes进行分类,并再次进行调整得到更精确的boxes。

class FastRCNNPredictor(nn.Module):
    """
    Standard classification + bounding box regression layers
    for Fast R-CNN.

    Arguments:
        in_channels (int): number of input channels
        num_classes (int): number of output classes (including background)
    """

    def __init__(self, in_channels, num_classes):
        super(FastRCNNPredictor, self).__init__()
        self.cls_score = nn.Linear(in_channels, num_classes)
        self.bbox_pred = nn.Linear(in_channels, num_classes * 4)

    def forward(self, x):
        if x.dim() == 4:
            assert list(x.shape[2:]) == [1, 1]
        x = x.flatten(start_dim=1)
        scores = self.cls_score(x)
        bbox_deltas = self.bbox_pred(x)

        return scores, bbox_deltas

Postprocess Detection

首先将Box Predicator预测出来的bbox的delta值和proposal boxes进行合并,得到每个proposal boxes的左上角和右下角坐标,和RPN的算法相似,可以参考Box-Coder.Decode,里面由详细介绍。

def postprocess_detections(self,
                               class_logits,    # type: Tensor
                               box_regression,  # type: Tensor
                               proposals,       # type: List[Tensor]
                               image_shapes     # type: List[Tuple[int, int]]
                               ):
        # type: (...) -> Tuple[List[Tensor], List[Tensor], List[Tensor]]
        device = class_logits.device
        num_classes = class_logits.shape[-1]

        boxes_per_image = [boxes_in_image.shape[0] for boxes_in_image in proposals]
        pred_boxes = self.box_coder.decode(box_regression, proposals)

然后对回归的前景对象分类进行softmax

pred_scores = F.softmax(class_logits, -1)

接着去取每张图片的boxes, scores和image_shape:

pred_boxes_list = pred_boxes.split(boxes_per_image, 0)
        pred_scores_list = pred_scores.split(boxes_per_image, 0)

        all_boxes = []
        all_scores = []
        all_labels = []
        for boxes, scores, image_shape in zip(pred_boxes_list, pred_scores_list, image_shapes):

然后将800x1216坐标clip到800x1202,具体参考torchvision Faster-RCNN ResNet-50 FPN代码解析(图片转换和坐标)

boxes = box_ops.clip_boxes_to_image(boxes, image_shape)

创建一个labels的张量,用来存放过滤出来的detection bbox的label index:

# create labels for each prediction
            labels = torch.arange(num_classes, device=device)
            labels = labels.view(1, -1).expand_as(scores)

移除背景labels, scores和boxes:

# remove predictions with the background label
            boxes = boxes[:, 1:]
            scores = scores[:, 1:]
            labels = labels[:, 1:]

移除低分的detection,这里score_thresh为0.05:

# batch everything, by making every class prediction be a separate instance
            boxes = boxes.reshape(-1, 4)
            scores = scores.reshape(-1)
            labels = labels.reshape(-1)

            # remove low scoring boxes
            inds = torch.nonzero(scores > self.score_thresh).squeeze(1)
            boxes, scores, labels = boxes[inds], scores[inds], labels[inds]

移除空的boxes:

# remove empty boxes
            keep = box_ops.remove_small_boxes(boxes, min_size=1e-2)
            boxes, scores, labels = boxes[keep], scores[keep], labels[keep]

经过这些步骤后得到大概这样的boxes和labels:

resnet50使用要求 resnet50 fpn_pytorch_09


最后用极大值抑制(NMS3)剔除那些重叠box:

# non-maximum suppression, independently done per class
            keep = box_ops.batched_nms(boxes, scores, labels, self.nms_thresh)
            # keep only topk scoring predictions
            keep = keep[:self.detections_per_img]
            boxes, scores, labels = boxes[keep], scores[keep], labels[keep]

这样得到的boxes和labels是:

resnet50使用要求 resnet50 fpn_pytorch_10

结语

经过ROI处理之后,可以检测到的对象已经比较精确了,而且这里还带有检测对象的分数,比如:

[
	0.9996865, 0.999302, 0.9909377, 
	0.964582, 0.8458481, 0.79095364, 
	0.3160024, 0.16850659, 0.16231589, 
	0.106609166, 0.07780073, 0.07285354, 0.06343418
]

这里还可以继续过滤一些分数比较低的detection,比如设置一个阈值为0.5,分数大于这个阈值,就是最终检测到的对象。


  1. 假设原始图片大小是599x900,转换之后输入图片大小为800x1202,然后通过padding之后用于backbone网络的处理大小是800x1216 ↩︎
  2. torchvision中ROI Align的算法是通过python C extension来实现的。 ↩︎
  3. torchvision中的NMS也是通过python C extension来实现的。 ↩︎