Learn more about notes YOLO V3 (PyTorch)

Last recorded TensorFlow Version This time because Prune Yes, try it PyTorch Version.

Overview of Source Directories

    ├── cfg                           // Network Definition File
    │   ├── yolov3.cfg
    │   ├── yolov3-spp.cfg
    │   ├── yolov3-tiny.cfg
    ├── data                          // Data Configuration
    │   ├── samples                   // Sample picture, detect.py The detection is the picture here
    │   ├── coco.names                // 80 coco for detection Names of categories
    │   ├── coco_paper.names          // coco original 91 Names of categories
    │   ├── coco2014.data             // coco 2014 Version of training test path configuration
    │   └── coco2017.data             // coco 2017 Version of training test path configuration
    ├── utils                         // Folder where the core code is located
    │   ├── __init__.py
    │   ├── adabound.py
    │   ├── datasets.py
    │   ├── google_utils.py
    │   ├── layers.py
    │   ├── parse_config.py
    │   ├── torch_utils.py
    │   └── utils.py
    ├── weights                       // Path to model
    │   ├── yolov3-spp-ultralytics.weights // Original YOLOV3 Model Format
    │   └── yolov3-spp-ultralytics.pt      // PyTorch Model Format 
    ├── detect.py      // demo Code
    ├── models.py      // Core Code
    ├── test.py        // Test Data Set mAP
    ├── train.py       // model training
    └── tutorial.ipynb // Using tutorials

Next, we go through the code in terms of data loading, network definition, network training, mAP testing, and so on.


The first step is to run theDetect.pyThis demo detection example.Familiar with the running parameters of the code, the instrumentation should be easy to understand literally. It should be emphasized that -- source calls the camera model if it is'0', and by default reads the sample pictures under the'data/samples'folder.

The entire code is executed in the order of network initialization->model loading->input picture loading->forward inference->NMS post-processing->visual detection results.

Initialize Model

model = Darknet(opt.cfg, imgsz)

This is based on the network definition file (in this paper,'cfg/yolov3-spp.cfg'For example) to define the network structure, this and the official Darknet The same design is saved, which is why the pytorch model generated by this code can seamlessly transform with the official Darknet model.

Here's a copy of'cfg/yolov3-spp.cfg'Network structure diagram:

Specific implementation referenceModels.pyThis code.One interesting thing about this is that it will be used by default. thop It seems that the use of thop in pytorch is fairly high based on the statistic parameters and calculations at fixed input (1, 3, 480, 640) scales.

The imgsz parameter is intended for onnx because the input size is fixed.


dataset = LoadImages(source, img_size=imgsz)

Data preprocessing here includes padding_resize and BGR2EGB, refer to utils/datasets.py.img /= 255.0 normalization is required for pictures before reasoning, i.e. I don't know why the author didn't put this action inDatasets.pyInside.


In addition to the usual forward inference, there is also an Augment images inference defined in the code that interesting people can look at.The main concern is the inference of the YOLOLayer layer, which is basically and TensorFlow Version Almost.

YOLOV3 has three scales of predictive output, as illustrated above in the 512x512 scale input and as an example of the first scale (minimum feature map) output of the coco dataset training model.The output dimension of YOLOLayer's previous level is 16 x 16 x 255, 16 x 16 corresponds to feature map size, 255=3* (4+1+80), of which 3 is due to the design of three Anchors on each scale, 4 is the offset of the center point xy, the offset of the width and height, 1 represents the inclusion of the target, and 80 is the confidence level of the specific category.

In the process, the author transforms the output dimensions of the previous layer of YOLOLayer:

# p.view(bs, 255, 16, 16) -- > (bs, 3, 16, 16, 85)  # (bs, anchors, grid, grid, xywh + classes)
p = p.view(bs, self.na, self.no, self.ny, self.nx).permute(0, 1, 3, 4, 2).contiguous()  # prediction

Then decode the predicted results as described in the paper:

io = p.clone()  # inference output
io[..., :2] = torch.sigmoid(io[..., :2]) + self.grid  # xy
io[..., 2:4] = torch.exp(io[..., 2:4]) * self.anchor_wh  # wh yolo method
io[..., :4] *= self.stride
torch.sigmoid_(io[..., 4:]) # Will change io
return io.view(bs, -1, self.no), p  # view [1, 3, 16, 16, 85] as [1, 768, 85]

Specific implementation referenceModels.py(


Training is the highlight of this version, compared to the official Darknet There are still many improvements.(


# Dataset
dataset = LoadImagesAndLabels(train_path, img_size, batch_size,
                              hyp=hyp,  # augmentation hyperparameters
                              rect=opt.rect,  # rectangular training

# Dataloader
batch_size = min(batch_size, len(dataset))
nw = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8])  # number of workers
dataloader = torch.utils.data.DataLoader(dataset,
                                         shuffle=not opt.rect,  # Shuffle=True unless rectangular training is used

Unlike inference, which involves loading and processing pictures, COO-like datasets are loaded here.Take training coco2017.data for example, training data train_path saves the image path of the training set, and the label file in txt format should be loaded in LoadImagesAndLabels.

Taking train2017/000000391895.jpg as an example, the label file format is as follows (cx, cy, w, h)

3 0.6490546875000001 0.7026527777777778 0.17570312500000002 0.59325
0 0.65128125 0.47923611111111114 0.2404375 0.8353611111111112
0 0.765 0.5468611111111111 0.05612500000000001 0.13361111111111112
1 0.7833203125 0.5577777777777778 0.047859375 0.09716666666666667

The picture size (HxW) is 360x640. If you set a training scale of 512 during training, then this picture needs to keep the aspect ratio zoomed to 288x512 and then padding to 512x512. Of course, the corresponding label needs to be adjusted, and the final picture needs to be converted to BGR2EGB.

Data augmentation includes:



Specific Reference: utils/datasets.py.

Considering the existence of data augmentation in the form of crop, this is different from TensorFlow Version Instead of matching Anchor at the data level, it was done in training.


It is characterized by the code dividing the trainable parameters into three groups, convolution layer weights into one group, bias into one group, and other parameters into one group.

pg0, pg1, pg2 = [], [], []  # optimizer parameter groups
for k, v in dict(model.named_parameters()).items():
    if '.bias' in k:
        pg2 += [v]  # biases
    elif 'Conv2d.weight' in k:
        pg1 += [v]  # apply weight_decay
        pg0 += [v]  # all else

In training, these three sets of parameters set different learning rates

if opt.adam:
    # hyp['lr0'] *= 0.1  # reduce lr (i.e. SGD=5E-3, Adam=5E-4)
    optimizer = optim.Adam(pg0, lr=hyp['lr0'])
    # optimizer = AdaBound(pg0, lr=hyp['lr0'], final_lr=0.1)
    # momentum There are so many defaults after one setup
    optimizer = optim.SGD(pg0, lr=hyp['lr0'], momentum=hyp['momentum'], nesterov=True)
optimizer.add_param_group({'params': pg1, 'weight_decay': hyp['weight_decay']})  # add pg1 with weight_decay
optimizer.add_param_group({'params': pg2})  # add pg2 (biases)

Basic learning rate using cosine

lf = lambda x: (((1 + math.cos(x * math.pi / epochs)) / 2) ** 1.0) * 0.95 + 0.05  # cosine

warmup training method is also designed in the code:

nb = len(dataloader)  # number of batches
n_burn = max(3 * nb, 500)
# Burn-in
if ni <= n_burn:
    xi = [0, n_burn]  # x interp
    model.gr = np.interp(ni, xi, [0.0, 1.0])  # giou loss ratio (obj_loss = 1.0 or giou)
    accumulate = max(1, np.interp(ni, xi, [1, 64 / batch_size]).round())
    for j, x in enumerate(optimizer.param_groups): # pg0 pg1 pg2
        # bias lr falls from 0.1 to lr0, all other lrs rise from 0.0 to lr0
        x['lr'] = np.interp(ni, xi, [0.1 if j == 2 else 0.0, x['initial_lr'] * lf(epoch)])
        x['weight_decay'] = np.interp(ni, xi, [0.0, hyp['weight_decay'] if j == 1 else 0.0])
        if 'momentum' in x:
            x['momentum'] = np.interp(ni, xi, [0.9, hyp['momentum']])
 learning rate weight decay 
momentum  accumulate

It is worth noting that Optimize updates the model every accumulate batch, that is, the actual batch_of the training regardless of warmup Size = 64.

accumulate = max(round(64 / batch_size), 1)


Like other versions, this code supports multi-scale training and is every effective bs = batch_size * accumulate Update training scales once

Training also uses Model Exponential Moving Average mechanism

ema = torch_utils.ModelEMA(model)

In addition to some predefined hyperparameters, here are some things to look atModel.grandModel.class_Weights, with predefined cls, cls_Pw, obj, obj_Like pw, these are some of the weights that are prepared to calculate loss

model.gr = 1.0  # giou loss ratio (obj_loss = 1.0 or giou)
model.class_weights = labels_to_class_weights(dataset.labels, nc).to(device)  # attach class weights


loss, loss_items = compute_loss(pred, targets, model)

This part is defined in utils/Utils.pyFile.

First, Anchor and GT try to match. The matching method is the central point matching method. There is an implicit risk that there may be a GT without any matching Anchor. Other versions will find a maximum match for the GT. There is no matching here, discarded by default.

Match function build_targets:

def build_targets(p, targets, model):
    # Build targets for compute_loss(), input targets(image,class,x,y,w,h)
    nt = targets.shape[0]
    tcls, tbox, indices, anch = [], [], [], []
    gain = torch.ones(6, device=targets.device)  # normalized to gridspace gain
    off = torch.tensor([[1, 0], [0, 1], [-1, 0], [0, -1]], device=targets.device).float()  # overlap offsets

    style = None
    multi_gpu = type(model) in (nn.parallel.DataParallel, nn.parallel.DistributedDataParallel)
    for i, j in enumerate(model.yolo_layers): # Output at each scale
        anchors = model.module.module_list[j].anchor_vec if multi_gpu else model.module_list[j].anchor_vec
        gain[2:] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gain
        na = anchors.shape[0]  # number of anchors
        at = torch.arange(na).view(na, 1).repeat(1, nt)  # anchor tensor, same as .repeat_interleave(nt)

        # Match targets to anchors
        a, t, offsets = [], targets * gain, 0
        if nt:
            # r = t[None, :, 4:6] / anchors[:, None]  # wh ratio
            # j = torch.max(r, 1. / r).max(2)[0] < model.hyp['anchor_t']  # compare
            j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n) = wh_iou(anchors(3,2), gwh(n,2))
            a, t = at[j], t.repeat(na, 1, 1)[j]  # filter

            # overlaps
            gxy = t[:, 2:4]  # grid xy
            z = torch.zeros_like(gxy)
            if style == 'rect2':
                g = 0.2  # offset
                j, k = ((gxy % 1. < g) & (gxy > 1.)).T
                a, t = torch.cat((a, a[j], a[k]), 0), torch.cat((t, t[j], t[k]), 0)
                offsets = torch.cat((z, z[j] + off[0], z[k] + off[1]), 0) * g

            elif style == 'rect4':
                g = 0.5  # offset
                j, k = ((gxy % 1. < g) & (gxy > 1.)).T
                l, m = ((gxy % 1. > (1 - g)) & (gxy < (gain[[2, 3]] - 1.))).T
                a, t = torch.cat((a, a[j], a[k], a[l], a[m]), 0), torch.cat((t, t[j], t[k], t[l], t[m]), 0)
                offsets = torch.cat((z, z[j] + off[0], z[k] + off[1], z[l] + off[2], z[m] + off[3]), 0) * g

        # Define
        b, c = t[:, :2].long().T  # image, class
        gxy = t[:, 2:4]  # grid xy
        gwh = t[:, 4:6]  # grid wh
        gij = (gxy - offsets).long()
        gi, gj = gij.T  # grid xy indices

        # Append
        indices.append((b, a, gj, gi))  # image, anchor, grid indices
        tbox.append(torch.cat((gxy - gij, gwh), 1))  # box
        anch.append(anchors[a])  # anchors
        tcls.append(c)  # class
        if c.shape[0]:  # if any targets
            assert c.max() < model.nc, 'Model accepts %g classes labeled from 0-%g, however you labelled a class %g. ' \
                                       'See https://github.com/ultralytics/yolov3/wiki/Train-Custom-Data' % (
                                           model.nc, model.nc - 1, c.max())

    return tcls, tbox, indices, anch
View Code

The matching function returns four variables: tcls, tbox, indices, anchors, or if the above example GT matches the first scale of Archor:

Anchor is predefined as follows, and the tag section is the first scale (the smallest featuremap is used to detect large targets):

anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326

1. Map this scale's Anchor to this featuremap scale to get anchors

2. Map GT to this featuremap scale to get t

3. Center point pattern matching ($Anchor_{num} x GT_{num}$)

In this case, the matched Anchor index a=[0, 0,1,1, 2] and GT are shown below t, b and c represent the index and category id of the matched GT in one batch, respectively:

            b                    c 

The gij statistic is the center point coordinates of the matched GT, GI and GJ are center_x and center_y

                                   gi                                                                                 gj

The last four output variables are as follows

tcls.append(c)  # class
tbox.append(torch.cat((gxy - gij, gwh), 1))  # box
indices.append((b, a, gj, gi))  # image, anchor, grid indices
anch.append(anchors[a])  # anchors

Here you can see that indices and anch can restore the location of a matching Anchor, while indices and tbox can restore the location of a matching GT

With matching results, we can determine which samples are positive and which are negative and which are the regression targets for each positive sample, and then we can calculate the classification and regression Loss.

Loss with or without target lobj and target category classification lcls uses BCEWithLogitsLoss if the superparameter fl_gamma is not zero, FocalLoss is used, and in particular BCELoss is cls_pw(1.0), obj_pw(1.0) Two weights are supported.Goal Regression Loss uses GIoU Loss, which happens to be CIoU This paper is open source github See the code author in issue asking CIoU Loss for some implementation details. I think later code authors should also add CIoU Loss.

In addition, the author of this code considers whether the lobj Loss containing the target will be supported by a giou ratio, i.e. the larger the GIoU, the closer the target label will be to 1 (originally a fixed value of 1), which should be designed to give a weaker supervisory signal to the poorly matched GT and reduce the difficulty of training.

tobj[b, a, gj, gi] = (1.0 - model.gr) + model.gr * giou.detach().clamp(0).type(tobj.dtype)  # giou ratio smooth
lobj += BCEobj(pi[..., 4], tobj)  # obj loss

Finally, the three Loss values are supported by the three weight values giou(3.54), obj(64.3), cls(37.4), and I'll look at them later on.

So the final loss of the whole detection = lbox + Lobj + lcls, if you've used TensorFlow Version You will find that there are many or the same codes, and the improvement point is these weights and giou ratio s.

Looking back, you will find thatModel.class_Weights are not used in this weight item code.

That's about the training section. There's time to fill in the pits.

Evaluate COCO

Normally this part is also a tool call, so I haven't looked at it carefully.Anyway, if you want to compare the performance of different frameworks or code-trained models, choose one and prepare the data in the appropriate format.


That's all for a simple understanding of the code. There are some holes to fill and there's time to fill them later.The pits to be filled include:

1. Data augmentation

2. Basis for setting weight values

3. Can be tested HAMBox Effect of replacing the original Anchor with the regressed Anchor to match the idea

4. Replace GIoU Loss with CIoU Loss

Next is a model trained with this code, try it. Prune Yes

Tags: network github Lambda

Posted on Wed, 03 Jun 2020 16:55:22 -0700 by CtrlAltDel