萬字長文，以程式碼的思想講解yolo3演算法實現原理和訓練過程和實戰

以程式碼的思想去詳細講解yolov3演算法的實現原理和訓練過程，並教使用visdrone2019資料集和自己製作資料集兩種方式去訓練自己的pytorch搭建的yolov3模型，吐血整理萬字長文，純屬乾貨！

實現思路

第一步：Pytorch搭建yolo3目標檢測平臺

模型yolov3和預訓練權重下載yolo3演算法原理實現思路

一、預測部分

1、yolo3的網路模型架構和實現

2、主幹特徵網路darknet53介紹和結果（獲取3個初始特徵層）

3、從初始特徵獲取預測結果（最終的3個有效的特徵層）

4、預測結果的解碼（對最終的3個有效特徵層的結果進行解碼

）5、在原圖上進行繪製（對解碼的結果資料在原圖繪製展現）

二、訓練部分

1、計算loss所需引數

2、prediction是什麼

3、target是什麼。

4、loss的計算過程

5、正式開始訓練

第二步：使用Visdrone2019訓練自己的模型yolov3模型

yolov3整體的資料夾結構

一、資料集準備

1。visdrone資料集訓練

2。自己製作資料集訓練

二、訓練和效果展示

3。正式開始訓練

4。訓練效果

三、利用訓練好了的模型進行預測

yolo演算法原理實現思路

模型yolov3和預訓練權重和資料集

，

關注我下載。

一.預測部分1.yolo3的網路模型結構如下：

如圖1所示：

輸入一張圖片任意大小的圖片然後資料處理為416*416*3的圖片大小到yolo3的模型中，

首先

經過主幹特徵提取網路darknet53會提取到3個初步的特徵層用於進行目標檢測，三個特徵層位於yolo模型的主幹特徵提取網路darknet53的不同位置，分別位於中間層P3、中下層P4、底層P5（P3，對應的是darknet從上向下的第3個網路模組，0開始），

如上圖紅色框所示

，三個特徵層的shape分別為（52，52，256），（26，26，256），（13，13，1024）。

這裡的52*52，26*26，13*13視覺化的理解是指原始圖片處理後得到模型可用的416*416的圖片分為52*52，26*26，13*13大小的網格，也就是這樣不同尺寸的特徵圖，分別用來檢測小目標，中等大小的的目標，較大的目標，因為特徵圖其上我們預先設定的先驗框的尺寸大小不一樣，在13*13的特徵圖上（有最大的感受野）應用較大的先驗框，用來檢測較大目標，如下圖所示，圖中的藍色框。

在由主幹特徵提取網路darknet53得到這樣的3個初始特徵層之後，還需要經過一定的處理，最終得到yolo3模型的最終3個有效的特徵層out0（P5），out1（P4），out2（P3），也就是yolov3的網路預測結果。

如上圖1綠色框所示

。

具體處理過程：

這個處理首先對P5經過5次卷積，之後有兩個處理：一是再經過一次conv2D3*3和一次conv2D 1*1最終得到我們初始特徵層P5的輸出的有效特徵層out0，用於檢測小目標。二是P5這5次卷積之後的結果進行再一次卷積Conv2D和上取樣UpSampling2D得到（batchsize，26，26，256），這用於和P4進行拼接。

P4和P5經過上取樣之後的結果進行一個拼接Concat得到（batchsize，26，26，768）之後再次經過Conv2D Block 256這樣的5次卷積之後得到（batchsize，26，26，256），然後也有倆個處理，和上述一樣，一是再經過一次conv2D3*3和一次conv2D 1*1最終得到我們初始特徵層P4的輸出的有效特徵層out1，用於檢測中等大小的目標。二是進行Conv2D和上取樣UpSampling2D用於和P3進行拼接。

P3和P4經過上取樣之後的結果進行拼接之後，經過5次Conv2D Block 128之後，只有一個處理，就是一次conv2D3*3和一次conv2D 1*1即可最終得到我們初始特徵層P3的輸出的有效特徵層out3。

2。主幹特徵網路darknet53介紹

YOLOv3相比於之前的yolo1和yolo2，改進較大，主要改進方向有：

1、主幹網路修改為darknet53，其重要特點是使用了殘差網路Residual，darknet53中的殘差卷積塊就是進行一次3X3、步長為2的卷積，然後儲存該卷積layer，再進行一次1X1的卷積（用於減少通道數）和一次3X3的卷積（增加通道數），並把這個結果加上layer作為最後的結果，殘差網路的特點是容易最佳化，並且能夠透過增加相當的深度來提高準確率。其內部的殘差塊使用了

跳躍連線

，緩解了在深度神經網路中增加深度帶來的梯度消失問題。

殘差塊示意圖：

將靠前若干層的某一層資料輸出直接跳過多層引入到後面資料層的輸入部分。意味著後面的特徵層的內容會有一部分由其前面的某一層線性貢獻。深度殘差網路的設計是為了克服由於網路深度加深而產生的學習效率變低與準確率無法有效提升的問題。

殘差塊

2、darknet53的每一個卷積部分使用了特有的DarknetConv2D結構，每一次卷積的時候進行l2正則化，完成卷積後進行BatchNormalization標準化與LeakyReLU。普通的ReLU是將所有的負值都設為零，Leaky ReLU則是給所有負值賦予一個非零斜率。以數學的方式我們可以表示為：

darknet53實現程式碼為：詳情請見：darknet53。py（定義主幹darknet53的網路結構）

import torch

import torch。nn as nn

import math

from collections import OrderedDict

#Residual Block

class BasicBlock（nn。Module）：

#初始化操作

def __init__（self， inplanes， planes）：

super（BasicBlock， self）。__init__（）

self。conv1 = nn。Conv2d（inplanes， planes［0］， kernel_size=1，

stride=1， padding=0， bias=False）

self。bn1 = nn。BatchNorm2d（planes［0］）

self。relu1 = nn。LeakyReLU（0。1）

self。conv2 = nn。Conv2d（planes［0］， planes［1］， kernel_size=3，

stride=1， padding=1， bias=False）

self。bn2 = nn。BatchNorm2d（planes［1］）

self。relu2 = nn。LeakyReLU（0。1）

#定義殘差快

def forward（self， x）：

residual = x

out = self。conv1（x）

out = self。bn1（out）

out = self。relu1（out）

out = self。conv2（out）

out = self。bn2（out）

out = self。relu2（out）

out += residual

return out

#darknet53網路結構

class DarkNet（nn。Module）：

def __init__（self， layers）：

super（DarkNet， self）。__init__（）

self。inplanes = 32

self。conv1 = nn。Conv2d（3， self。inplanes， kernel_size=3， stride=1， padding=1， bias=False）

self。bn1 = nn。BatchNorm2d（self。inplanes）

self。relu1 = nn。LeakyReLU（0。1）

self。layer1 = self。_make_layer（［32， 64］， layers［0］）

self。layer2 = self。_make_layer（［64， 128］， layers［1］）

self。layer3 = self。_make_layer（［128， 256］， layers［2］）

self。layer4 = self。_make_layer（［256， 512］， layers［3］）

self。layer5 = self。_make_layer（［512， 1024］， layers［4］）

self。layers_out_filters = ［64， 128， 256， 512， 1024］

# 進行權值初始化

for m in self。modules（）：

if isinstance（m， nn。Conv2d）：

n = m。kernel_size［0］ * m。kernel_size［1］ * m。out_channels

m。weight。data。normal_（0， math。sqrt（2。 / n））

elif isinstance（m， nn。BatchNorm2d）：

m。weight。data。fill_（1）

m。bias。data。zero_（）

def _make_layer（self， planes， blocks）：

layers = ［］

# 下采樣，步長為2，卷積核大小為3

layers。append（（“ds_conv”， nn。Conv2d（self。inplanes， planes［1］， kernel_size=3，

stride=2， padding=1， bias=False）））

layers。append（（“ds_bn”， nn。BatchNorm2d（planes［1］）））

layers。append（（“ds_relu”， nn。LeakyReLU（0。1）））

# 加入darknet模組

self。inplanes = planes［1］

for i in range（0， blocks）：

layers。append（（“residual_{}”。format（i）， BasicBlock（self。inplanes， planes）））

return nn。Sequential（OrderedDict（layers））

def forward（self， x）：

x = self。conv1（x）

x = self。bn1（x）

x = self。relu1（x）

x = self。layer1（x）

x = self。layer2（x）

out3 = self。layer3（x）

out4 = self。layer4（out3）

out5 = self。layer5（out4）

return out3， out4， out5

def darknet53（pretrained， **kwargs）：

model = DarkNet（［1， 2， 8， 8， 4］）

if pretrained：

if isinstance（pretrained， str）：

model。load_state_dict（torch。load（pretrained））

else：

raise Exception（“darknet request a pretrained path。 got ［{}］”。format（pretrained））

return model

3、從初始特徵獲取預測結果

1、在特徵提取部分，yolo3藉助darknet53提取多特徵層進行目標檢測，一共提取三個初始特徵層P5，P4，P3，三個特徵層位於主幹部分darknet53的不同位置，分別位於中間層，中下層，底層，三個特徵層的shape分別為（52，52，256）、（26，26，512）、（13，13，1024）。

2、對這三個初始的特徵層進行5次卷積處理等操作之後，處理完後一部分用於輸出該特徵層對應的預測結果out0，out1，out2，一部分用於進行反捲積UmSampling2d後與其它初始特徵層進行結合。

3、輸出層（最終的3個有效特徵層）的shape分別為（13，13，75），（26，26，75），（52，52，75），最後一個維度為75是因為該圖是基於voc資料集的，它的類為20種，yolo3只有針對每一個特徵層存在3個先驗框，所以最後維度為3x25；

如果使用的是coco訓練集，類則為80種，最後的維度應該為255 = 3x85，三個特徵層的shape為（13，13，255），（26，26，255），（52，52，255）。

其實際情況就是，由於我們使用得是Pytorch，它的通道數預設在第一位，輸入N張416x416的圖片，在經過多層的運算後，會輸出三個shape分別為（N，255，13，13），（N，255，26，26），（N，255，52，52）的資料，對應每個圖分為13x13、26x26、52x52的網格上3個先驗框的位置。

實現程式碼如下：詳情請見：yolo3。py（定義yolo3的整個網路結構模型）

import torch

import torch。nn as nn

from collections import OrderedDict

from nets。darknet import darknet53

def conv2d（filter_in， filter_out， kernel_size）：

pad = （kernel_size - 1） // 2 if kernel_size else 0

return nn。Sequential（OrderedDict（［

（“conv”， nn。Conv2d（filter_in， filter_out， kernel_size=kernel_size， stride=1， padding=pad， bias=False）），

（“bn”， nn。BatchNorm2d（filter_out）），

（“relu”， nn。LeakyReLU（0。1）），

］））

def make_last_layers（filters_list， in_filters， out_filter）：

m = nn。ModuleList（［

conv2d（in_filters， filters_list［0］， 1），

conv2d（filters_list［0］， filters_list［1］， 3），

conv2d（filters_list［1］， filters_list［0］， 1），

conv2d（filters_list［0］， filters_list［1］， 3），

conv2d（filters_list［1］， filters_list［0］， 1），

conv2d（filters_list［0］， filters_list［1］， 3），

nn。Conv2d（filters_list［1］， out_filter， kernel_size=1，

stride=1， padding=0， bias=True）

］）

return m

class YoloBody（nn。Module）：

def __init__（self， config）：

super（YoloBody， self）。__init__（）

self。config = config

# backbone

self。backbone = darknet53（None） # darknert53用於提取初始特徵

out_filters = self。backbone。layers_out_filters

# last_layer0

final_out_filter0 = len（config［“yolo”］［“anchors”］［0］） * （5 + config［“yolo”］［“classes”］）

self。last_layer0 = make_last_layers（［512， 1024］， out_filters［-1］， final_out_filter0）

# embedding1

final_out_filter1 = len（config［“yolo”］［“anchors”］［1］） * （5 + config［“yolo”］［“classes”］）

self。last_layer1_conv = conv2d（512， 256， 1）

self。last_layer1_upsample = nn。Upsample（scale_factor=2， mode=‘nearest’）

self。last_layer1 = make_last_layers（［256， 512］， out_filters［-2］ + 256， final_out_filter1）

# embedding2

final_out_filter2 = len（config［“yolo”］［“anchors”］［2］） * （5 + config［“yolo”］［“classes”］）

self。last_layer2_conv = conv2d（256， 128， 1）

self。last_layer2_upsample = nn。Upsample（scale_factor=2， mode=‘nearest’）

self。last_layer2 = make_last_layers（［128， 256］， out_filters［-3］ + 128， final_out_filter2）

def forward（self， x）：

def _branch（last_layer， layer_in）：

for i， e in enumerate（last_layer）：

layer_in = e（layer_in）

if i == 4：

out_branch = layer_in

return layer_in， out_branch

# backbone

x2， x1， x0 = self。backbone（x）

# yolo branch 0

out0， out0_branch = _branch（self。last_layer0， x0）

# yolo branch 1

x1_in = self。last_layer1_conv（out0_branch）

x1_in = self。last_layer1_upsample（x1_in）

x1_in = torch。cat（［x1_in， x1］， 1）

out1， out1_branch = _branch（self。last_layer1， x1_in）

# yolo branch 2

x2_in = self。last_layer2_conv（out1_branch）

x2_in = self。last_layer2_upsample（x2_in）

x2_in = torch。cat（［x2_in， x2］， 1）

out2， _ = _branch（self。last_layer2， x2_in）

return out0， out1， out2

4、預測結果的解碼和最終預測框篩選

由第三步我們可以獲得最終三個有效特徵層的預測結果，shape分別為（N，255，13，13），（N，255，26，26），（N，255，52，52）的資料，對應每個圖分為13x13、26x26、52x52的網格上3個預測框的位置。

但是這個預測結果並不對應著最終的預測框在圖片上的位置，還需要解碼才可以完成。我們利用yolov3的網路預測結果會對我們的預先設定好了的先驗框進行調整，獲得最終的預測框，對先驗框進行調整的過程我們稱作解碼的過程。

總結：先驗框解碼的過程就是利用yolov3網路的預測結果（3個有效的特徵層）對先驗框進行調整的過程，調整完就是預測框。

此處要講一下yolo3的預測原理，yolo3的3個特徵層分別將整幅圖分為13x13、26x26、52x52的網格，每個網路點負責一個區域的檢測。

我們知道特徵層的預測結果對應著三個預測框的位置，若是coco資料集，我們先將其reshape一下，其結果為（N，3，85，13，13，），（N，3，85，26，26），（N，3，85，52，52）。

維度中的85包含了4+1+80，分別代表x_offset、y_offset、h和w、置信度、分類結果，如果是voc資料

，則為25

。

yolo3的具體解碼過程：

在程式碼中就是首先生成特徵層大小的網格，然後將我們預先設定好了的在原圖中416*416先驗框的尺寸調整到有效特徵層大小上，最後從yolov3的網路預測結果獲得先驗框的中心調整引數

x_offset和y_offset

和寬高的調整引數

h和w，

對在特徵層尺寸大小上的先驗框進行調整

，

將每個網格點加上它對應的x_offset和y_offset的結果就是調整後的先驗框的中心，也就是預測框的中心，然後再利用先驗框和h、w結合計算出調整後的先驗框的的長和寬，也就是預測框的高和寬，這樣就能得到在特徵層上整個預測框的位置了，最後我們將在有效特徵層上的預測框的位置再調整到原圖416*416的大小上。

以13*13有效特徵層為例：左圖是先驗框在有效特徵層調整的視覺化，右圖是在原圖上繪製的調整後的先驗框，即真實的預測框。

解碼實現程式碼如下：詳情見utils。py（對yolov3的網路預測結果進行解碼顯示）

class DecodeBox（nn。Module）：

def __init__（self， anchors， num_classes， img_size）：

super（DecodeBox， self）。__init__（）

self。anchors = anchors

self。num_anchors = len（anchors）

self。num_classes = num_classes

self。bbox_attrs = 5 + num_classes

self。img_size = img_size

def forward（self， input）：

batch_size = input。size（0）

input_height = input。size（2）

input_width = input。size（3）

# 計算步長

stride_h = self。img_size［1］ / input_height

stride_w = self。img_size［0］ / input_width

# 歸一到特徵層上

scaled_anchors = ［（anchor_width / stride_w， anchor_height / stride_h） for anchor_width， anchor_height in self。anchors］

# 對預測結果進行resize

prediction = input。view（batch_size， self。num_anchors，

self。bbox_attrs， input_height， input_width）。permute（0， 1， 3， 4， 2）。contiguous（）

# 先驗框的中心位置的調整引數

x = torch。sigmoid（prediction［。。。， 0］）

y = torch。sigmoid（prediction［。。。， 1］）

# 先驗框的寬高調整引數

w = prediction［。。。， 2］

# Width

h = prediction［。。。， 3］

# Height

# 獲得置信度，是否有物體

conf = torch。sigmoid（prediction［。。。， 4］）

# 種類置信度

pred_cls = torch。sigmoid（prediction［。。。， 5：］）

# Cls pred。

FloatTensor = torch。cuda。FloatTensor if x。is_cuda else torch。FloatTensor

LongTensor = torch。cuda。LongTensor if x。is_cuda else torch。LongTensor

# 生成網格，先驗框中心，網格左上角

grid_x = torch。linspace（0， input_width - 1， input_width）。repeat（input_width， 1）。repeat（

batch_size * self。num_anchors， 1， 1）。view（x。shape）。type（FloatTensor）

grid_y = torch。linspace（0， input_height - 1， input_height）。repeat（input_height， 1）。t（）。repeat（

batch_size * self。num_anchors， 1， 1）。view（y。shape）。type（FloatTensor）

# 生成先驗框的寬高

anchor_w = FloatTensor（scaled_anchors）。index_select（1， LongTensor（［0］））

anchor_h = FloatTensor（scaled_anchors）。index_select（1， LongTensor（［1］））

anchor_w = anchor_w。repeat（batch_size， 1）。repeat（1， 1， input_height * input_width）。view（w。shape）

anchor_h = anchor_h。repeat（batch_size， 1）。repeat（1， 1， input_height * input_width）。view（h。shape）

# 計算調整後的先驗框中心與寬高

pred_boxes = FloatTensor（prediction［。。。，：4］。shape）

pred_boxes［。。。， 0］ = x。data + grid_x

pred_boxes［。。。， 1］ = y。data + grid_y

pred_boxes［。。。， 2］ = torch。exp（w。data） * anchor_w

pred_boxes［。。。， 3］ = torch。exp（h。data） * anchor_h

# 用於將輸出調整為相對於416x416的大小

_scale = torch。Tensor（［stride_w， stride_h］ * 2）。type（FloatTensor）

output = torch。cat（（pred_boxes。view（batch_size， -1， 4） * _scale，

conf。view（batch_size， -1， 1）， pred_cls。view（batch_size， -1， self。num_classes））， -1）

return output。data

5、在原圖上進行繪製

透過第四步，我們就可以獲得預測框在原圖上的位置，當然得到最終的預測結果後還要進行得分排序與非極大抑制篩選，因為右圖我們可以看到，由於一個網格點有3個先驗框，則調整後有3個預測框，在原圖上繪製的時候，同一個目標就有3個預測框，那要找出最合適的預測框，我們需要進行篩選。如下圖舉例：假設3個藍色的是我們獲得的預測框，黃色的是真實框，紅色的是用與預測目標的網格，我們就需要對這檢測同一個目標的網格點上的3個調整後的先驗框（也就是預測框）進行篩選。

這一部分基本上是所有目標檢測通用的部分。不過該專案的處理方式與其它專案不同。其對於每一個類進行判別。

1、取出每一類得分大於self.obj_threshold的框和得分。

2、利用框的位置和得分進行非極大抑制。

詳情請見yolo。py和utils。py。

二、訓練部分

1、計算loss所需引數

在計算loss的時候，實際上是網路預測結果prediction和目標target之間的對比：prediction：就是你輸入一張圖片給yolov3網路模型最終的預測結果，也就是3個有效特徵層，每一張圖片最後都對應3個有效特徵層。target：就是你製作的訓練集中標註圖片中的資料資訊，這是網路的真實框情況。

2、prediction是什麼

對於yolo3的模型來說，網路最後輸出的內容就是三個有效特徵層，3個有效特徵層的每個網格點（特徵點）對應著預測框及其種類，即三個特徵層分別對應著圖片被分為不同size的網格後，每個網格點上三個先驗框對應的位置、置信度及其種類。

輸出層的shape分別為（13，13，75），（26，26，75），（52，52，75），最後一個維度為75是因為是基於voc資料集的，它的類為20種，yolo3的每一個特徵層的每一個特徵點（網格點）都預先設定3個先驗框，每個先驗框包含1+4=20個引數資訊，1代表這個先驗框內部是否有目標，4代表框的xywh引數資訊，20代表框的種類資訊，所以每一個特徵點對應3*25引數，即最後維度為3x25。如果使用的是coco訓練集，類則為80種，最後的維度應該為255 = 3x85，三個特徵層的shape為（13，13，255），（26，26，255），（52，52，255）

注意：此處得到的yolov3的網路預測結果（3個有效特徵層）y_prediction此時並沒有解碼，也就是yolov3。py中yolobody類的輸出結果，有效特徵層解碼了之後才是真實影象上的情況。

3、target是什麼。

target就是一個真實影象中，真實框的情況。第一個維度是batch_size，第二個維度是每一張圖片裡面真實框的數量，第三個維度內部是真實框的資訊，包括位置以及種類。

4、loss的計算過程

拿到pred和target後，不可以簡單的減一下作為對比，需要進行如下步驟。

第一步：對yolov3網路的預測結果進行解碼，獲得網路預測結果對先驗框的調整資料

第二步：對真實框進行處理，獲得網路應該真正有的對先驗框的調整資料，也就是網路真正應該有的預測結果，然後和我們得到的網路的預測結果進行對比，程式碼中get_target函式

判斷真實框在圖片中的位置，判斷其屬於哪一個網格點去檢測。

判斷真實框和哪個預先設定的先驗框重合程度最高。

計算該網格點應該有怎麼樣的預測結果才能獲得真實框（利用真實框的資料去調整預先設定好了的先驗框，得到真實框該網格點應該預測的先驗框的調整資料）

對所有真實框進行如上處理。

獲得網路應該有的預測結果，將其與yolov3預測實際的預測結果對比。

第三步：將真實框內部沒有目標的對應的網路的預測結果的且重合程度較大的先驗框進行忽略，因為圖片的真實框中沒有目標，也就是這個框的內部沒有物件，框的位置資訊是沒有用的，網路輸出的這個先驗框的資訊和其代表的種類是沒有意義的，這樣的得到調整的先驗框應該被忽略掉，網路只輸出框內部有目標的資料資訊，程式碼中get_ignore函式。第四步：利用真實框得到網路真正的調整資料和網路預測的調整資料後，我們就對其進行對比loss計算，如下：

這裡需要注意的是：上述處理過程依次對3個有效特徵才層進行計算的，因為yolov3是分3個有效特徵層進行預測的，計算3個有效特徵層的loss的值相加之後就是我們模型最終的loss值，就可以進行反向傳播和梯度下降了。程式碼實現上述過程詳情請見yolo_training。py

from random import shuffle

import numpy as np

import torch

import torch。nn as nn

import math

import torch。nn。functional as F

from matplotlib。colors import rgb_to_hsv， hsv_to_rgb

from PIL import Image

from utils。utils import bbox_iou

def clip_by_tensor（t，t_min，t_max）：

t=t。float（）

result = （t >= t_min）。float（） * t + （t < t_min）。float（） * t_min

result = （result <= t_max）。float（） * result + （result > t_max）。float（） * t_max

return result

def MSELoss（pred，target）：

return （pred-target）**2

def BCELoss（pred，target）：

epsilon = 1e-7

pred = clip_by_tensor（pred， epsilon， 1。0 - epsilon）

output = -target * torch。log（pred） - （1。0 - target） * torch。log（1。0 - pred）

return output

class YOLOLoss（nn。Module）：

def __init__（self， anchors， num_classes， img_size）：

super（YOLOLoss， self）。__init__（）

self。anchors = anchors

self。num_anchors = len（anchors）

self。num_classes = num_classes

self。bbox_attrs = 5 + num_classes

self。img_size = img_size

self。ignore_threshold = 0。5

self。lambda_xy = 1。0

self。lambda_wh = 1。0

self。lambda_conf = 1。0

self。lambda_cls = 1。0

def forward（self， input， targets=None）：

# 一共多少張圖片

bs = input。size（0）

# 特徵層的高

in_h = input。size（2）

# 特徵層的寬

in_w = input。size（3）

# 計算步長

stride_h = self。img_size［1］ / in_h

stride_w = self。img_size［0］ / in_w

# 把先驗框的尺寸調整成特徵層大小的形式

scaled_anchors = ［（a_w / stride_w， a_h / stride_h） for a_w， a_h in self。anchors］

# reshape

prediction = input。view（bs， int（self。num_anchors/3），

self。bbox_attrs， in_h， in_w）。permute（0， 1， 3， 4， 2）。contiguous（）

# 對prediction預測進行調整

x = torch。sigmoid（prediction［。。。， 0］）

# Center x

y = torch。sigmoid（prediction［。。。， 1］）

# Center y

w = prediction［。。。， 2］

# Width

h = prediction［。。。， 3］

# Height

conf = torch。sigmoid（prediction［。。。， 4］）

# Conf

pred_cls = torch。sigmoid（prediction［。。。， 5：］）

# Cls pred。

# 找到哪些先驗框內部包含物體

mask， noobj_mask， tx， ty， tw， th， tconf， tcls， box_loss_scale_x， box_loss_scale_y =\

self。get_target（targets， scaled_anchors，

in_w， in_h，

self。ignore_threshold）

noobj_mask = self。get_ignore（prediction， targets， scaled_anchors， in_w， in_h， noobj_mask）

box_loss_scale_x = （2-box_loss_scale_x）。cuda（）

box_loss_scale_y = （2-box_loss_scale_y）。cuda（）

box_loss_scale = box_loss_scale_x*box_loss_scale_y

mask， noobj_mask = mask。cuda（）， noobj_mask。cuda（）

tx， ty， tw， th = tx。cuda（）， ty。cuda（）， tw。cuda（）， th。cuda（）

tconf， tcls = tconf。cuda（）， tcls。cuda（）

# losses。

loss_x = torch。sum（BCELoss（x， tx） / bs * box_loss_scale * mask）

loss_y = torch。sum（BCELoss（y， ty） / bs * box_loss_scale * mask）

loss_w = torch。sum（MSELoss（w， tw） / bs * 0。5 * box_loss_scale * mask）

loss_h = torch。sum（MSELoss（h， th） / bs * 0。5 * box_loss_scale * mask）

loss_conf = torch。sum（BCELoss（conf， mask） * mask / bs） + \

torch。sum（BCELoss（conf， mask） * noobj_mask / bs）

loss_cls = torch。sum（BCELoss（pred_cls［mask == 1］， tcls［mask == 1］）/bs）

loss = loss_x * self。lambda_xy + loss_y * self。lambda_xy + \

loss_w * self。lambda_wh + loss_h * self。lambda_wh + \

loss_conf * self。lambda_conf + loss_cls * self。lambda_cls

# print（loss， loss_x。item（） + loss_y。item（）， loss_w。item（） + loss_h。item（），

# loss_conf。item（）， loss_cls。item（）， \

# torch。sum（mask），torch。sum（noobj_mask））

return loss， loss_x。item（）， loss_y。item（）， loss_w。item（）， \

loss_h。item（）， loss_conf。item（）， loss_cls。item（）

def get_target（self， target， anchors， in_w， in_h， ignore_threshold）：

# 計算一共有多少張圖片

bs = len（target）

# 獲得先驗框

anchor_index = ［［0，1，2］，［3，4，5］，［6，7，8］］［［13，26，52］。index（in_w）］

subtract_index = ［0，3，6］［［13，26，52］。index（in_w）］

# 建立全是0或者全是1的陣列

mask = torch。zeros（bs， int（self。num_anchors/3）， in_h， in_w， requires_grad=False）

noobj_mask = torch。ones（bs， int（self。num_anchors/3）， in_h， in_w， requires_grad=False）

tx = torch。zeros（bs， int（self。num_anchors/3）， in_h， in_w， requires_grad=False）

ty = torch。zeros（bs， int（self。num_anchors/3）， in_h， in_w， requires_grad=False）

tw = torch。zeros（bs， int（self。num_anchors/3）， in_h， in_w， requires_grad=False）

th = torch。zeros（bs， int（self。num_anchors/3）， in_h， in_w， requires_grad=False）

tconf = torch。zeros（bs， int（self。num_anchors/3）， in_h， in_w， requires_grad=False）

tcls = torch。zeros（bs， int（self。num_anchors/3）， in_h， in_w， self。num_classes， requires_grad=False）

box_loss_scale_x = torch。zeros（bs， int（self。num_anchors/3）， in_h， in_w， requires_grad=False）

box_loss_scale_y = torch。zeros（bs， int（self。num_anchors/3）， in_h， in_w， requires_grad=False）

for b in range（bs）：

for t in range（target［b］。shape［0］）：

# 計算出在特徵層上的點位

gx = target［b］［t， 0］ * in_w

gy = target［b］［t， 1］ * in_h

gw = target［b］［t， 2］ * in_w

gh = target［b］［t， 3］ * in_h

# 計算出屬於哪個網格

gi = int（gx）

gj = int（gy）

# 計算真實框的位置

gt_box = torch。FloatTensor（np。array（［0， 0， gw， gh］））。unsqueeze（0）

# 計算出所有先驗框的位置

anchor_shapes = torch。FloatTensor（np。concatenate（（np。zeros（（self。num_anchors， 2）），

np。array（anchors））， 1））

# 計算重合程度

anch_ious = bbox_iou（gt_box， anchor_shapes）

# Find the best matching anchor box

best_n = np。argmax（anch_ious）

if best_n not in anchor_index：

continue

# Masks

if （gj < in_h） and （gi < in_w）：

best_n = best_n - subtract_index

# 判定哪些先驗框內部真實的存在物體

noobj_mask［b， best_n， gj， gi］ = 0

mask［b， best_n， gj， gi］ = 1

# 計算先驗框中心調整引數

tx［b， best_n， gj， gi］ = gx - gi

ty［b， best_n， gj， gi］ = gy - gj

# 計算先驗框寬高調整引數

tw［b， best_n， gj， gi］ = math。log（gw / anchors［best_n+subtract_index］［0］）

th［b， best_n， gj， gi］ = math。log（gh / anchors［best_n+subtract_index］［1］）

# 用於獲得xywh的比例

box_loss_scale_x［b， best_n， gj， gi］ = target［b］［t， 2］

box_loss_scale_y［b， best_n， gj， gi］ = target［b］［t， 3］

# 物體置信度

tconf［b， best_n， gj， gi］ = 1

# 種類

tcls［b， best_n， gj， gi， int（target［b］［t， 4］）］ = 1

else：

print（‘Step {0} out of bound’。format（b））

print（‘gj： {0}， height： {1} | gi： {2}， width： {3}’。format（gj， in_h， gi， in_w））

continue

return mask， noobj_mask， tx， ty， tw， th， tconf， tcls， box_loss_scale_x， box_loss_scale_y

def get_ignore（self，prediction，target，scaled_anchors，in_w， in_h，noobj_mask）：

bs = len（target）

anchor_index = ［［0，1，2］，［3，4，5］，［6，7，8］］［［13，26，52］。index（in_w）］

scaled_anchors = np。array（scaled_anchors）［anchor_index］

# print（scaled_anchors）

# 先驗框的中心位置的調整引數

x_all = torch。sigmoid（prediction［。。。， 0］）

y_all = torch。sigmoid（prediction［。。。， 1］）

# 先驗框的寬高調整引數

w_all = prediction［。。。， 2］

# Width

h_all = prediction［。。。， 3］

# Height

for i in range（bs）：

x = x_all［i］

y = y_all［i］

w = w_all［i］

h = h_all［i］

FloatTensor = torch。cuda。FloatTensor if x。is_cuda else torch。FloatTensor

LongTensor = torch。cuda。LongTensor if x。is_cuda else torch。LongTensor

# 生成網格，先驗框中心，網格左上角

grid_x = torch。linspace（0， in_w - 1， in_w）。repeat（in_w， 1）。repeat（

int（self。num_anchors/3）， 1， 1）。view（x。shape）。type（FloatTensor）

grid_y = torch。linspace（0， in_h - 1， in_h）。repeat（in_h， 1）。t（）。repeat（

int（self。num_anchors/3）， 1， 1）。view（y。shape）。type（FloatTensor）

# 生成先驗框的寬高

anchor_w = FloatTensor（scaled_anchors）。index_select（1， LongTensor（［0］））

anchor_h = FloatTensor（scaled_anchors）。index_select（1， LongTensor（［1］））

anchor_w = anchor_w。repeat（1， 1， in_h * in_w）。view（w。shape）

anchor_h = anchor_h。repeat（1， 1， in_h * in_w）。view（h。shape）

# 計算調整後的先驗框中心與寬高

pred_boxes = torch。FloatTensor（prediction［0］［。。。，：4］。shape）

pred_boxes［。。。， 0］ = x。data + grid_x

pred_boxes［。。。， 1］ = y。data + grid_y

pred_boxes［。。。， 2］ = torch。exp（w。data） * anchor_w

pred_boxes［。。。， 3］ = torch。exp（h。data） * anchor_h

pred_boxes = pred_boxes。view（-1， 4）

for t in range（target［i］。shape［0］）：

gx = target［i］［t， 0］ * in_w

gy = target［i］［t， 1］ * in_h

gw = target［i］［t， 2］ * in_w

gh = target［i］［t， 3］ * in_h

gt_box = torch。FloatTensor（np。array（［gx， gy， gw， gh］））。unsqueeze（0）

anch_ious = bbox_iou（gt_box， pred_boxes， x1y1x2y2=False）

anch_ious = anch_ious。view（x。size（））

noobj_mask［i］［anch_ious>self。ignore_threshold］ = 0

# print（torch。max（anch_ious））

return noobj_mask

5.正式訓練

正式訓練：包括資料集的載入和預處理（圖片的歸一化、框的座標格式的轉換、圖片的通道的改變、資料增強等）請見yolotrain。py中的 Generator類，預訓練權重的匯入、網路模型的正向傳播和反向傳播梯度下降，請見train。py。

yolov3預訓練權重的下載和visdrone2019資料集，關注我獲取。

訓練自己的yolov3模型

yolo3整體的資料夾構架：

本文使用VOC格式進行訓練。

一. 資料集的準備

1.visdron2019資料集的下載訓練：

下載完成後放在VOCdevkit資料夾下，利用我放置在VOCdevkit下的det_to_voc。py進行visdrone資料集轉Voc格式的轉換，生成的xml檔案你可以存放在Annotations裡面，也可以自己單獨建立一份xml資料夾存放，只要在你voc2yolo3。py轉換時注意xmlfilepath的路徑就好了。