总体架构1
ROI对从RPN中选出来的1000个Proposal Boxes,以及从FPN中输出的多层特征图进行ROI Pool,对于box中的对象进行分类,并再次进行Proposal Boxes偏移(offset/delta)数值回归,产生新的分数和再次微调的box,以及得到标签,最后再次进行非极大值抑制(NMS):
基于FPN的ROI处理会比传统的Faster RCNN多出一些步骤,要更加复杂一些。
主要包含如下步骤:
- Box ROI Pool,根据1000个Proposal box的面积,确定选择在哪一层特征图上进行ROI Pool操作
- Box Head,由两个全连接层组成,对ROI Align处理出来的7x7的bounding-box所包含的特征图进一步处理
- Box Predicator,在Box Head处理得到的结果在进一步进行分类和Box的位置偏移(offset)做数值回归
- Postprocess Detection,做Softmax,进行最后分类,并将Box的位置偏移回归结果和Proposal boxes进行合并,得到调整后的detection boxes,最后进行极大值抑制(NMS)过滤出有效的detection结果(scores, boxes和labels)。
Box ROI Pool
本模型可以同时对多个图像进行处理,分别检测出各个图片中对象,所以首先需要通过convert_to_roi_format将各个图像的每层特征图合并在一起,统一进行ROI Align处理。
setup_scales则对输出的前4个特征图(最小的pool层不用),对mapper对象进行配置。Mapper是FPN中引入的一个新的概念,主要是计算Proposal Box的面积,并根据面积算出在哪一个特征图层进行ROI Align处理,具体可以参考论文中:
对应的实现代码为:
class LevelMapper(object):
"""Determine which FPN level each RoI in a set of RoIs should map to based
on the heuristic in the FPN paper.
Arguments:
k_min (int)
k_max (int)
canonical_scale (int)
canonical_level (int)
eps (float)
"""
def __init__(self, k_min, k_max, canonical_scale=224, canonical_level=4, eps=1e-6):
# type: (int, int, int, int, float) -> None
self.k_min = k_min
self.k_max = k_max
self.s0 = canonical_scale
self.lvl0 = canonical_level
self.eps = eps
def __call__(self, boxlists):
# type: (List[Tensor]) -> Tensor
"""
Arguments:
boxlists (list[BoxList])
"""
# Compute level ids
s = torch.sqrt(torch.cat([box_area(boxlist) for boxlist in boxlists]))
# Eqn.(1) in FPN paper
target_lvls = torch.floor(self.lvl0 + torch.log2(s / self.s0) + torch.tensor(self.eps, dtype=s.dtype))
target_lvls = torch.clamp(target_lvls, min=self.k_min, max=self.k_max)
return (target_lvls.to(torch.int64) - self.k_min).to(torch.int64)
比如有一个Proposal Box对应宽高分别为:100, 120,那么根据上述公式:
下表是ResNet50和FPN的对应关系,参考libtorch学习笔记(17)- ResNet50 FPN以及如何应用于Faster-RCNN
ResNet Layer Name | ResNet Level(k) | FPN Level | Minimum Area() |
conv1 | 1 | n/a | n/a |
conv2_x | 2 | 0 | |
conv3_x | 3 | 1 | |
conv4_x | 4 | 2 | |
conv5_x | 5 | 3 | |
n/a | n/a | pool |
所以这个proposal box会从Feature Map Level#0(2 - 2 = 0)中取出特征图进行RoI Align2处理。
Box Head
这部分包括两个全连接层,并用于后续的预测模块用来做分类和bouding-box delta预测:
class TwoMLPHead(nn.Module):
"""
Standard heads for FPN-based models
Arguments:
in_channels (int): number of input channels
representation_size (int): size of the intermediate representation
"""
def __init__(self, in_channels, representation_size):
super(TwoMLPHead, self).__init__()
self.fc6 = nn.Linear(in_channels, representation_size)
self.fc7 = nn.Linear(representation_size, representation_size)
def forward(self, x):
x = x.flatten(start_dim=1)
x = F.relu(self.fc6(x))
x = F.relu(self.fc7(x))
return x
Box Predicator
这部分主要用来对1000个proposal boxes进行分类,并再次进行调整得到更精确的boxes。
class FastRCNNPredictor(nn.Module):
"""
Standard classification + bounding box regression layers
for Fast R-CNN.
Arguments:
in_channels (int): number of input channels
num_classes (int): number of output classes (including background)
"""
def __init__(self, in_channels, num_classes):
super(FastRCNNPredictor, self).__init__()
self.cls_score = nn.Linear(in_channels, num_classes)
self.bbox_pred = nn.Linear(in_channels, num_classes * 4)
def forward(self, x):
if x.dim() == 4:
assert list(x.shape[2:]) == [1, 1]
x = x.flatten(start_dim=1)
scores = self.cls_score(x)
bbox_deltas = self.bbox_pred(x)
return scores, bbox_deltas
Postprocess Detection
首先将Box Predicator预测出来的bbox的delta值和proposal boxes进行合并,得到每个proposal boxes的左上角和右下角坐标,和RPN的算法相似,可以参考Box-Coder.Decode,里面由详细介绍。
def postprocess_detections(self,
class_logits, # type: Tensor
box_regression, # type: Tensor
proposals, # type: List[Tensor]
image_shapes # type: List[Tuple[int, int]]
):
# type: (...) -> Tuple[List[Tensor], List[Tensor], List[Tensor]]
device = class_logits.device
num_classes = class_logits.shape[-1]
boxes_per_image = [boxes_in_image.shape[0] for boxes_in_image in proposals]
pred_boxes = self.box_coder.decode(box_regression, proposals)
然后对回归的前景对象分类进行softmax
pred_scores = F.softmax(class_logits, -1)
接着去取每张图片的boxes, scores和image_shape:
pred_boxes_list = pred_boxes.split(boxes_per_image, 0)
pred_scores_list = pred_scores.split(boxes_per_image, 0)
all_boxes = []
all_scores = []
all_labels = []
for boxes, scores, image_shape in zip(pred_boxes_list, pred_scores_list, image_shapes):
然后将800x1216坐标clip到800x1202,具体参考torchvision Faster-RCNN ResNet-50 FPN代码解析(图片转换和坐标)
boxes = box_ops.clip_boxes_to_image(boxes, image_shape)
创建一个labels的张量,用来存放过滤出来的detection bbox的label index:
# create labels for each prediction
labels = torch.arange(num_classes, device=device)
labels = labels.view(1, -1).expand_as(scores)
移除背景labels, scores和boxes:
# remove predictions with the background label
boxes = boxes[:, 1:]
scores = scores[:, 1:]
labels = labels[:, 1:]
移除低分的detection,这里score_thresh为0.05:
# batch everything, by making every class prediction be a separate instance
boxes = boxes.reshape(-1, 4)
scores = scores.reshape(-1)
labels = labels.reshape(-1)
# remove low scoring boxes
inds = torch.nonzero(scores > self.score_thresh).squeeze(1)
boxes, scores, labels = boxes[inds], scores[inds], labels[inds]
移除空的boxes:
# remove empty boxes
keep = box_ops.remove_small_boxes(boxes, min_size=1e-2)
boxes, scores, labels = boxes[keep], scores[keep], labels[keep]
经过这些步骤后得到大概这样的boxes和labels:
最后用极大值抑制(NMS3)剔除那些重叠box:
# non-maximum suppression, independently done per class
keep = box_ops.batched_nms(boxes, scores, labels, self.nms_thresh)
# keep only topk scoring predictions
keep = keep[:self.detections_per_img]
boxes, scores, labels = boxes[keep], scores[keep], labels[keep]
这样得到的boxes和labels是:
结语
经过ROI处理之后,可以检测到的对象已经比较精确了,而且这里还带有检测对象的分数,比如:
[
0.9996865, 0.999302, 0.9909377,
0.964582, 0.8458481, 0.79095364,
0.3160024, 0.16850659, 0.16231589,
0.106609166, 0.07780073, 0.07285354, 0.06343418
]
这里还可以继续过滤一些分数比较低的detection,比如设置一个阈值为0.5,分数大于这个阈值,就是最终检测到的对象。
- 假设原始图片大小是599x900,转换之后输入图片大小为800x1202,然后通过padding之后用于backbone网络的处理大小是800x1216 ↩︎
- torchvision中ROI Align的算法是通过python C extension来实现的。 ↩︎
- torchvision中的NMS也是通过python C extension来实现的。 ↩︎