手把手教你实现YOLOv3 (二)

1 引言

上篇中,我们重点介绍了DarkNet53的网络结构以及YOLOv3的结构,并对相应的检测头做了讲解.

本文继续这个系列的第二篇,主要讲解YOLO3的模型后处理预测阶段的原理和实现.

闲话少述,我们直接开始吧. :)

2 理解输出

在讲解前,我们需要明确以下几点:

  • YOLOv3网络的输入尺寸为 (m,416,416,3), 其中m代表每个batch中图像数目,本文讲解中m=1,代表每个batch处理1张输入图像
  • YOLOv3分3个尺度进行预测, 3个尺度的特征图的大小依次为13X13,26X26以及52X52
  • YOLOv3中每个cell预测3个bounding box,每个bounding box 可以表示为6元组 [公式] .
  • 在COCO数据集中一共有80个类别,此时我们将c扩展成80维向量,这样我们每个bounding box可以用85维向量进行表示

我们拿个图来进进一步的阐述:

如上图所示:

  • 我们输入图像尺寸为416X416,降采样32倍,得到特征图的大小为13X13.也就是说我们将输入图像划分成13X13的网格,每个cell对应输入图像中对应32X32的区域.
  • 如果每个cell在原图中包含物体真实框的中心点时,那么这个cell负责预测该目标.在上图中,黄色框为目标狗子的真值框,红色cell负责预测该目标

3 理解bounding box

YOLOv3网络结构中有3个分支(3个不同尺度的特征图)被送到decode函数来进一步进行解析.

上图中,黑色虚线框代表 priori box (anchor) ,蓝色实线框代表预测框.

  • [公式] 表示我们预测框的中心点和宽高
  • [公式] 表示我们网络的输出
  • [公式] 代表预测的目标中心相对于所在cell左上角坐标的偏移量
  • [公式] 表示 priori box (anchor) 的宽和高
  • [公式] 代表目标中心所在cell的左上角坐标

基于上述公式,我们可以编写相应的decode函数,相关解释已在代码中进行注释,不再累述.

代码实现如下:

def decode(conv_layer,i=0):
    """
    param: conv_layer nXhXwX255
    """
    n,h,w,c = conv_layer.shape
    conv_output = conv_layer.view(n,h,w,3,5+self.num_class)
    # divide output
    conv_raw_dxdy = conv_output[:, :, :, :, 0:2]  # offset of center position
    conv_raw_dwdh = conv_output[:, :, :, :, 2:4]  # Prediction box length and width offset
    conv_raw_conf = conv_output[:, :, :, :, 4:5]  # confidence of the prediction box
    conv_raw_prob = conv_output[:, :, :, :, 5:]   # category probability of the prediction box

    # grid to 13X13 26X26 52X52
    yv, xv = torch.meshgrid(torch.arange(0, h), torch.arange(0, w))
    yv_new = yv.unsqueeze(dim=-1)
    xv_new = xv.unsqueeze(dim=-1)
    xy_grid = torch.concat([xv_new,yv_new],dim=-1)
    # reshape and repeat
    xy_grid = xy_grid.view(1,h,w,1,2)         # (13,13,2)-->(1,13,13,1,2)
    xy_grid = xy_grid.repeat(n,1,1,3,1).float() # (1,13,13,1,2)--> (1,13,13,3,2)

    # Calculate teh center position  and h&w of the prediction box
    pred_xy = (torch.sigmoid(conv_raw_dxdy) + xy_grid)* self.strides[i]
    pred_wh = (torch.exp(conv_raw_dwdh) * self.anchors[i]) * self.strides[i]
    pred_xywh = torch.concat([pred_xy,pred_wh],dim=-1)
    # score and cls
    pred_conf = torch.sigmoid(conv_raw_conf)
    pred_prob = torch.sigmoid(conv_raw_prob)

    return torch.concat([pred_xywh,pred_conf,pred_prob],dim=-1)

4 NMS后处理

之前博客中有讲解过NMS的详细过程,这里简单概括其核心思想如下:

  • 从候选框中选择置信度最高的box
  • 计算当前框和其他框的IOU, 如果IOU>iou_threshold则移除对应的box
  • 重复上述步骤进行迭代 直到剩余框中没有和当前挑选出的box重叠iou大于阈值的框

上述过程主要是为了过滤和当前框具有很大重叠度的框,针对每个目标仅保留网络预测置信度最高的那个框,参考下图:

非最大值抑制的代码实现如下:

def nms(bboxes, iou_threshold, sigma=0.3, method='nms'):
    """
    :param bboxes: (xmin, ymin, xmax, ymax, score, class)
    Note: soft-nms, https://arxiv.org/pdf/1704.04503.pdf
          https://github.com/bharatsingh430/soft-nms
    """
    classes_in_img = list(set(bboxes[:, 5]))
    best_bboxes = []

    for cls in classes_in_img:
        cls_mask = (bboxes[:, 5] == cls)
        cls_bboxes = bboxes[cls_mask]
        # Process 1: Determine whether the number of bounding boxes is greater than 0
        while len(cls_bboxes) > 0:
            # Process 2: Select the bounding box with the highest score according to socre order A
            max_ind = np.argmax(cls_bboxes[:, 4])
            best_bbox = cls_bboxes[max_ind]
            best_bboxes.append(best_bbox)
            cls_bboxes = np.concatenate([cls_bboxes[: max_ind], cls_bboxes[max_ind + 1:]])
            # Process 3: Calculate this bounding box A and
            # Remain all iou of the bounding box and remove those bounding boxes whose iou value is higher than the threshold
            iou = bboxes_iou(best_bbox[np.newaxis, :4], cls_bboxes[:, :4])
            weight = np.ones((len(iou),), dtype=np.float32)

            assert method in ['nms', 'soft-nms']

            if method == 'nms':
                iou_mask = iou > iou_threshold
                weight[iou_mask] = 0.0

            if method == 'soft-nms':
                weight = np.exp(-(1.0 * iou ** 2 / sigma))

            cls_bboxes[:, 4] = cls_bboxes[:, 4] * weight
            score_mask = cls_bboxes[:, 4] > 0.
            cls_bboxes = cls_bboxes[score_mask]
    return best_bboxes

5 加载训练参数

为了将这个预测代码跑起来,我们来加载官方训练的YOLO3的权重,

首先下载对应的权重文件,如下所示:

wget https://pjreddie.com/media/files/yolov3.weights

接着我们来分析YOLOv3的官方权重文件架构:

  • 官方权重文件为二进制文件,其中包含以串行方式存储的权重.权重只是存储为浮点数,没有任何东西指引我们他们属于那一层.
  • YOLOv3的权重只属于两种类型的网络层,一种是BN层,一种是卷积层.这些层的权重完全按照它们在配置文件中出现的顺序存储.当BN层出现在卷积块中时,卷积不带bias.当卷积层后没有BN层时,该卷积带bias.\

有了上述认识后,我们来写相应的加载函数,如下所示:

def read_param_from_file(yolo_ckpt,model):
    wf = open(yolo_ckpt, 'rb')
    major, minor, vision, seen, _ = np.fromfile(wf, dtype=np.int32, count=5)
    print("version major={} minor={} vision={} and pic_seen={}".format(major, minor, vision, seen))

    model_dict = model.state_dict()
    key_list = [key for key in model_dict.keys() ]
    num = 6
    length = int(len(key_list)//num)
    pre_index = 0
    for i in range(length+2):
        cur_list = key_list[pre_index:pre_index+num]
        conv_name = cur_list[0]
        conv_layer = model_dict[conv_name]
        filters = conv_layer.shape[0]
        in_dim = conv_layer.shape[1]
        k_size = conv_layer.shape[2]
        conv_shape = (filters,in_dim,k_size,k_size)
        # print("i={} and list={} amd conv_name={} and shape={}".format(i, cur_list,conv_name,conv_shape))
        if len(cur_list) == 6: # with bn
            # darknet bn param:[bias,weight,mean,variance]
            bn_bias = np.fromfile(wf, dtype=np.float32, count= filters)
            model_dict[cur_list[2]].data.copy_( torch.from_numpy(bn_bias))
            bn_weight = np.fromfile(wf, dtype=np.float32, count=filters)
            model_dict[cur_list[1]].data.copy_(torch.from_numpy(bn_weight))
            bn_mean = np.fromfile(wf, dtype=np.float32, count=filters)
            model_dict[cur_list[3]].data.copy_(torch.from_numpy(bn_mean))
            bn_variance = np.fromfile(wf, dtype=np.float32, count=filters)
            model_dict[cur_list[4]].data.copy_(torch.from_numpy(bn_variance))
            # darknet conv param:(out_dim, in_dim, height, width)
            conv_weights = np.fromfile(wf, dtype=np.float32, count=np.product(conv_shape))
            conv_weights = conv_weights.reshape(conv_shape)
            model_dict[cur_list[0]].data.copy_(torch.from_numpy(conv_weights))
        else:
            conv_bias = np.fromfile(wf, dtype=np.float32, count= filters)
            model_dict[cur_list[1]].data.copy_(torch.from_numpy(conv_bias))
            conv_weights = np.fromfile(wf, dtype=np.float32, count=np.product(conv_shape))
            conv_weights = conv_weights.reshape(conv_shape)
            model_dict[cur_list[0]].data.copy_(torch.from_numpy(conv_weights))

        pre_index += num
        if i in [57, 65, 73]:
            num = 2
        else:
            num = 6
    assert len(wf.read()) == 0, 'failed to read all data'

6 最终效果

代码克隆:

git clone git@github.com:sgzqc/yolov3_pytorch.git

运行脚本:

python3 LoadModel.py

结果如下:

IOU阈值0.3下输出:

IOU阈值0.5下输出:

7 总结

本节主要实现了YOLO v3网络推理解码部分和网络后处理部分的讲解,重点集中于预测头的解码和后处理中NMS的代码实现.并给出了完整的代码链接.

您学废了吗?

完整代码,戳我.