In this paper, MTCNN, one of the most classical neural networks in artificial intelligence deep learning, is explained in detail. More than 20000 words of content, mainly from the theoretical and practical aspects of MTCNN for detailed analysis, comparable to textbooks. If you don't read, start your journey!
Catalog:
 Basic chatting
 Face recognition
 Theoretical analysis of MTCNN
 Detailed analysis of project code
1, Basic chat
1. identification:
(1) Digital recognition: ideal state. The image size is the same, and the interference items (noise) are few;
(2) Face recognition: real state.
2. In video recognition.
One second frame, that is, 24 pictures.
3. There is a upper limit for the number of people identified by the punch.
Less: 5070; more: 100200.
4. Acquaintance identification.
Train station, access control, etc. At present, acquaintance recognition can only achieve more than 80%.
5. Stranger recognition.
High value.
6. companies.
Kuangshi technology, Shangtang technology.
7.IOU
Key points and difficulties.
8.NMS
Key points and difficulties.
9. Reverse operation of characteristic graph
Key points and difficulties.
2, Face recognition
(1) Face detection
Track the faces in the picture.
(2) Feature extraction
Take out the face part, put it into the neural network, extract the features, get the feature vector.
(3) Face comparison
The feature vector is used to compare the face features in the existing registry. Make cosine similarity comparison.
Note: among them, face detection is the most important.
3, Theoretical analysis of MTCNN
1. History of neural network
2. History of neural networks
(1) RCNN variants
 RCNN–>fast RCNN–>faster RCNN–>YOLO ( V1,V2,V3 )
Among them, YOLO v3

YOLO V2 – > yolo9000 (can recognize 9000 objects)

YOLO–>SSD
3. characteristics
4. cascade

decompose

Series connection
5. R & D Institute
Mr. Qiao Yu, Shenzhen Institute of advanced technology, Chinese Academy of Sciences
6. Losses and models
(1) Loss:
The most important part of neural network is loss. When the loss is solved, 90% of the neural network project problems are basically solved. Loss is the ultimate goal of solving the problem. The valuable paper is to study on the design loss; the paper only focuses on the model, and the paper value is not high.
(2) Model:
Improve network accuracy.
7. Image tracking
(1) Single target tracking
There is only one target to look for on one image. There are two methods for single target tracking:

Four coordinate values of the upper left corner and the lower right corner of the target area are found on the image.

It is simple, easy to implement and used by most networks.

Output four values: upper left corner and lower right corner. Sample labels do the same.


Find the center point, width and height of the target area on the image.

Disadvantages: the center point has a great influence on the frame; the calculation of the center point and the width and height is large: first, you need to find the upper left corner and the lower right corner, and then calculate the center point.

Advantage: if the center point is outside the picture, there are some tracking objects in the picture (such as half a cat). In this case, we need to do special processing in the sample, but in general, tracking half target is very few.

(2) Multitarget tracking

Three target tracking

Solution
Three sets of values. Each set of values (four coordinate points) represents a target. Use the three sets of values box to select three targets.

Existing problems
Only one target in the box. See Figure 1 below. Like apple, the three people always choose the biggest apple.

This problem should not be solved from the point of label. For example, in the street, there are many backgrounds such as buildings, vehicles, etc. in this environment, people are identified because they have human characteristics. In the same way, let neural network recognize how many people there are on a picture, and find the location of people. How did you find it? It's because people have their own characteristics and they have common characteristics. Therefore, in the process of multitarget tracking, the simplification problem is to let the neural network know whether it is human or not, that is, just let the neural network train to extract human characteristics. Therefore, we should not take a group of people training network, just need the network neural network to do one thing, that is to extract the characteristics of people, that is, take a person training network. In the output value, the confidence degree (01) indicates whether it is a person, which belongs to the two classification problem. The other four values are coordinates. When there is a face, the confidence is close to 1, at this time, the output of meaningful coordinate value; when the confidence is close to 0, the output of coordinate value is meaningless, output four zeros, four zeros are meaningless.
The three boxes will go to the same goal because they don't matter. When you think of the problem as taking an apple, three people take three apples, it's time for three people to line up, one person takes an apple, so the problem is solved. Similarly, you can sort three boxes. When a box is framed, you can't frame this target next time. You can only frame the remaining targets.

Multiple target tracking
 Solution thought
When the problem is replaced by 10 or more goals, it can be solved by using the idea of circulation. First, design a network, train the network, output five values, one confidence and four coordinate values; then, use the network, when the frame to a person's face, then use the cyclic thought to continue to frame the remaining faces.
During use, start from the upper left corner, as shown in Figure 2 below. This scanning method is similar to convolution, but it is likely to divide a face into two parts, as shown in Figure 3 below. The problem is the step size. The solution is to offset the first result, as shown in Figures 4 and 5. In fact, the step size is a kind of offset, and the step size is a little smaller. This is a face that will be framed many times, as shown in Figure 6, to resolve the drop below. At this time, the box we give is a fixed box. The problem is that some faces are relatively large, as shown in Figure 7. There are two solutions here: multi suggestion box and image pyramid. Multi suggestion box: use many boxes to scan. Prepare a group of boxes (Figure 8), and each box has three sizes, a total of 9 boxes (yo uses multi suggestion box), use the square to frame the face, use the vertical line box for the electric pole, etc. Image pyramid: fix the box, keep the box unchanged, and scale the image. When the image is scaled to the same size as the box, stop scaling. The scaling code is implemented using a while loop.
Scan image from left to right – > don't use too large step size – > use image pyramid in MTCNN to solve the problem – > when the step size is too small, a face will be framed by many boxes – > use NMS to solve the problem, and keep the box with high reliability
 Reminder
30 * 30 face can be recognized, which depends on the production of samples. As shown in Figure 9.
As shown in Figure 10, the PS situation is excluded.
NTCNN suggests a face with a minimum frame size of 12x12 (minimum lower limit), as shown in Figure 11, the face of 12x12 is magnified 2850 times.
MTCNN is more suitable for face.
Using the trained network to recognize people, using the confidence degree and four coordinate values is the basis of recognition. Next, input the image area into the network, then use the pyramid method to scale the original image for face recognition, and then frame the face. Each time, cut a part of the original image and put it into the network for recognition. When scaling, scale according to the shortest side (according to the maximum When the edge is zoomed, it cannot be divided). The zoom technique is to use the shortest edge. The zoom result is shown in Figure 12, and the step size is 2 to scan the translation box (when the shortest edge of the original image is 12, the zoom is stopped).
It is easy to train and difficult to use.
Figure 9 30 * 30 Face Recognition Fig. 10 phenomenon of big frame with small frame Figure 11 face magnification of 12 * 12: 2850 times Figure 12 zoom technique ### 8.IOU
Overlap algorithm.

Calculates the overlap of the two boxes.

Intersection / Union.
(1) Purpose:
Framing.
(2) Function:
Identifies whether it is a stack of boxes. As shown in Figure 13. When IOU is 0, it means not a bunch of boxes.
Fig. 13 IOU ####(3) Intersection calculation:The simple calculation is shown in the left figure in Figure 14. The corner coordinates can be used for simple calculation. The difficulty lies in the calculation of intersection on the right side in Figure 14: first calculate the coordinates of the intersection point, and then calculate the intersection area.
Find a general method: Figure 15, as follows:

Coordinates of the upper left corner of the intersection: the upper left corner X and Y in the two original boxes take the larger value respectively;

Coordinates of the lower right corner of the intersection: the lower right corner X and Y of the two original intersecting boxes take the smaller values respectively.
Add the two rectangular areas and subtract the intersection area. That is to say, to calculate Union, we must first calculate intersection (as shown in Figure 14). Lower right corner  x and y of upper left corner, calculate area.
(5) Usage scenario:
 P. R network results use IOU, because the accuracy of these two networks is low, the results of retaining large frame and small frame are shown in Figure 16.
(6) Algorithm implementation theory:
Compare a box with a bunch of boxes.
 How to calculate the area of a pile of frames?
A: as shown in Figure 17, use (data in the third column minus data in the first column) * data in the fourth column minus data in the second column. Use matrix to calculate quickly.
 How to retrieve column data?
A: slice. (box[:,2]box[:,0])* (box[:,3]box[:,1])
Figure 17 area calculation diagram of a pile of frames ####(7) Code:import numpy as np """IOU""" def iou(box,boxes,isMin=False):#The format of the box is defined as [X1,Y1,X2,Y2,C]. Compare a box with a bunch of boxes. In order to distinguish between intersection and union or minimum area, the minimum area is given the default value of 0 #Calculate the area of each box box_area=(box[2]box[0])*(box[3]box[1])#Calculate the area of the box first. Area calculation of a frame: (X2X1) * (Y2Y1). Index to get coordinate value: (box[2]box[0]) * (box[3]box[1]) boxes_area=(boxes[:,2]boxes[:,0])*(boxes[:,3]boxes[:,1])#The format of a stack of boxes: [], [], [], [], [], [],...] """Calculate intersection area""" xx1=np.maximum(box[0],boxes[:,0])#Top left X. Coordinates of the upper left corner of the intersection: the upper left corner X and Y of the two original intersecting boxes take the larger value as the coordinates of the upper left corner of the intersection. The upper left corner x value of the compared box: box[0]; the upper left corner x value of the compared box: boxes[0]. Remove the larger of the two. yy1=np.maximum(box[1],boxes[:,1])#The same thing. Y in the upper left corner. xx2 = np.minimum(box[2], boxes[:, 2]) # Same thing. Lower right X. yy2 = np.minimum(box[3], boxes[:, 3]) # The same thing. Y in the lower right corner. #Judge whether there is intersection w=np.maximum(0,xx2xx1)#When the value of xx2xx1 is negative, there is no intersection, and the result without intersection can be changed to 0. Use the maximum function to get the larger value. h=np.maximum(0,yy2yy1)#The same principle. #Calculate intersection area inter=w*h if isMin:#If isMin is True, it is divided by the minimum area. over=np.true_divide(inter,np.minimum(box_area,boxes_area))#True? Divide: Division. When isMin is true, divide by the minimum area. How to get the minimum area? Compare box area with boxes area, and take the minimum value to get the minimum area. else:#Otherwise, divide by the Union area. over = np.true_divide(inter, (box_area+boxes_areainter))#Add two rectangular areas and subtract intersection area return over
9. threshold
When a large frame is nested with a small frame and there is an overlap, the IOU is small. Figure 18 gives a threshold value, such as 0.3. When the IOU is greater than 0.3, it is regarded as a pile. When the IOU is less than 0.3, it is regarded as two piles of data.
(1) Purpose
Continue framing.
Fig. 18 threshold ### 10.NMS(1) Purpose:
Go out the extra boxes.
(2) Thought
Figure 19. First, the confidence is sorted; then, the maximum value is used to compare the IOU with the rest values; the IOU values of 0.98 and 0.83 are large, and the box to the same object is deleted by 0.83; then, the IOU values of 0.98 and 0.81 are 0, which are two objects, which are retained; in turn, the IOU values of 0.98, 0.81 and 0.67 are retained; next, the IOU is compared by 0.81 and 0.67. The final result is shown in Figure 20.
For example: 0.98 0.83 0.81 0.75 0.67
NMS is done on each diagram. Because of the pyramid, many boxes are reserved after NMS is used.
Fig. 19 NMS Figure 20 final results ####(3) NMS algorithm code:Sort a bunch of boxes according to their confidence;
Remove the first box. When the dimension in a stack of boxes is less than or equal to 1, it means that the retrieval is completed;
Save the first box out;
At the same time, the remaining boxes are reserved;
Compare IOU.
"""NMS""" def nms(boxes,thresh=0.3,isMin=False):#All boxes, thresholds, minimum area are required (pass to IOU, because IOU is calculated in NMS) #Sort according to the confidence level from large to small. _boxes=boxes[(boxes[:,4]).argsort()] #Get a bunch of boxes sorted by confidence #The format of the box is defined as: [[X1,Y1,X2,Y2,C], [], [], [], [],...]. #Keep remaining boxes r_boxes=[] #Remove the first box. Because it's going to take many times, use the loop. (key) while _boxes.shape[0]>1:#The first frame (shape[0]) is retrieved circularly. When the dimension retrieved during the cycle is greater than 1, it indicates that there is a frame; when the dimension is less than 1, it indicates that the frame has been retrieved and the cycle is over. #Take out the first box a_box=_boxes[0] #Remove the remaining boxes b_boxes=_boxes[1:] #Keep first box r_boxes.append(a_box) #After comparing IOU, keep the smaller value of threshold index=np.where(iou(a_box,b_boxes,isMin)<thresh)#Comparing iou with threshold value: iou (abox, Bboxes, ismin) < thresh, if iou is less than threshold value, keep it. Use np.where, when less than True. _boxes=b_boxes[index] #Save results if _boxes.shape[0]>0: r_boxes.append(_boxes[0]) #Assemble as matrix return np.stack(r_boxes)
11. Use of coordinate value activation function
 Softmax:
 The loss function cannot use Softmax. The value field does not meet the demand (greater than 1);
 Softmax exclusivity. The output is a probability distribution, and its sum is 1. There is a relationship between the output results. The output of the network is composed of the center point and the width and height. There is no relationship between the four values.
12. Programming method
Matrix (parallel) is used instead of for loop (serial) to improve the calculation speed. For example, a matrix can calculate the area of all suggestion boxes at once.
13. Introspection operation of characteristic graph
(1) Basic ideas
When using MTCNN for face recognition, the network is used as a convolution kernel to scan the image. The input size of the network is equivalent to finding the area with face on the image one by one. This area is called suggestion box. Whether there is a face in the suggestion box is judged by 5 values, 1 confidence and 4 coordinate points. When the confidence is close to 1, it indicates that there is a face in the suggestion box. When moving the suggestion boxes one by one, in order to avoid missing the area with face, it is necessary to overlap the suggestion boxes, which requires setting the step size. With the step size, the suggestion box will end up with a large number (not to mention the increase in the number of suggestion boxes caused by the pyramid). Because in order to get a suggestion box for a person's face, we need to use IOU and NMS to subtract most of the suggestion boxes with low confidence. In the calculation process of IOU and NMS, IOU is calculated by dividing the intersection of two suggestion boxes by the union. In the calculation process of union, the area size of two suggestion boxes is calculated (the calculation of intersection is directly calculated by the coordinate values of the upper left corner and the lower right corner of the intersection, as detailed above; the calculation of union is calculated by calculating the area sum of two suggestion boxes, and then subtracting De intersection area). When calculating the area, you should first know the coordinate values of the upper left corner and the lower right corner of the suggestion box. Next, we use the feature map calculated by the network to solve the coordinate values of the upper left corner and the lower right corner of the recommended box in the original map.
(2) Calculation of intermediate convolution once

Ideal situation.
If the original image is convoluted to get a characteristic image of 2x2 size, the convolution kernel size is 4x4 and the step size is 3. How to calculate the original size in reverse?
Solution:
As shown in Figure 25 below, (a) is the original image; (b) is the characteristic image obtained after convolution, marked with index value in the image, index value (0,1) in the upper left corner coordinate of the original image is (3,0), where 3 is step size; lower right corner coordinate is (7,3), where 7 is step size 3 + convolution kernel size 4, and 4 is convolution kernel size.
Original image (a) ! [insert picture description here] (https://imgblog.csdnimg.cn/20200211204426632.pngාpic_center) Index diagram (b) Figure 25 ideal situation In the reverse calculation, the original picture position is displayed in coordinates (x,y). X corresponds to w, y of the picture corresponds to h of the picture, that is, the picture format is wh. However, the result of the convolution is in the format of nchw. In this case, the coordinates obtained from the direct inverse calculation need to be converted, that is, * the index needs to be converted to the position * *.Solution:
Upper left corner coordinate: index x step. For example, index (0,0) reversely solves the coordinates of the upper left corner of the original drawing as: (0,0) * 3 = (0,0); index (1,0) reversely solves the coordinates of the upper left corner of the original drawing as: (1,0) * 3 = (3,0); index (0,1) reversely solves the coordinates of the upper left corner of the original drawing as (0,3); index (1,1) reversely solves the coordinates of the upper left corner of the original drawing as (3,3).
Coordinates of the lower right corner: index x step size + convolution kernel size. For example, index (0,0) solves the upper left corner coordinates of the original drawing in reverse: (0,0) * 3 + 4 = (4,4); index (1,0) solves the upper left corner coordinates of the original drawing in reverse: (1,0) * 3 + 4 = (7,4); index (0,1) solves the upper left corner coordinates of the original drawing in reverse: (4,7); index (1,1) solves the upper left corner coordinates of the original drawing in reverse: (7,7).
Note: if there is scale, divide the two results by the scale.
Original image (a) ! [insert picture description here] (https://imgblog.csdnimg.cn/20200211204501814.pngාpic_center) Index diagram (b) Figure 26 actual situation ####(3) Multiple calculation of intermediate convolution
thought

Many convolutions are called one convolution (for example, two 33 convolutions replace one 55 convolution), that is, many layers of neural network are regarded as a large convolution kernel.

The convolution kernel size is equal to the original image size.

The large convolution kernel step is equal to the product of single small convolution kernel step.
(5) The application of reverse calculation of feature graph
Use the network convolution result to calculate the size of the suggestion box.
Figure 27 schematic diagram of reverse calculation ###14. Network structurePRO network is equivalent to HRtechnologysupervisor in real interview process
P network has a short processing time and the fastest processing time, i.e. small network, low precision and nonstandard (only a few cursory decisions can be made that you are OK, no big problem in mind, good health, in short, just an individual), but p network has the longest drawing time in practical application (large amount of data processed); R network has a high precision (for example, every person should be considered in the interview Technology, the difficulty of technology itself is large, and the processing time is slow; Onetwork processing time is long, the network is the largest, and the processing accuracy is the highest (when understanding, refer to the actual interview, the supervisor should talk with the interviewee for a long time, which can not be determined at one time). The supervisor will draw you a big cake, slowly attract you, and "brainwash" you. It's not a moment.
Figure 28 MTCNN network structure ####(1) P network Network design
Input 12X12, output 1X1, the middle can be regarded as the convolution kernel of 12X12.
 Firstly, after 3 x 3 convolution kernel, the step size is 1, and the feature map of 10 x 10 is obtained;
 After 3 x 3 Maximum pooling, step size is 2 (there is a part of overlap, more information is not lost, more information is retained), and a feature map of 5 x 5 size is obtained;
 After 3 x 3 convolution kernel step length is 1, 3 x 3 characteristic graph is obtained;
 Finally, after 33 convolution kernels, we get 11 size characteristic graphs.
There are three layers of 3 x 3 convolution, the first layer uses pooling, and finally obtains 1 x 13 2 size feature map. The last layer uses full convolution instead of full connection (full connection fusion channel, but the image size will be limited. Full connection multiplies W and H, and receives the image of the data. The convolution neural network format is NCHW, and the full connection format is NV. When using full connection, CHW needs to be multiplied).
The final results are output into three kinds of results: first, we use two convolution check 1x1x32 to do convolution to get 1x1x2, that is, the confidence (the original paper uses Softmax activation to get two confidence). It is recommended to use the result of 1x1x1 and activate with Sigmoid to get a value. Because the confidence only needs one value, change the original paper, after all, the original paper published earlier, the idea is not comprehensive enough); second, use two convolution checks 1x1x32 to do convolution to get 1x1x4, get four coordinate values of the face (two coordinate points have four values in total); third, the results of 1x1x10 mark the five senses (two eyes, one nose, one mouth) key points, eyes for the original paper Two points for eyes, one for nose and two for mouth.
Figure 29 P network structure Network usage Finally, 1x1x1 is the confidence value and 1x1x4 is the face coordinate value. The two are treated separately and the activation function is used respectively. The training confidence and the data set used in the training coordinates are different. Training confidence is a two classification problem, which uses face and no face data sets. When training coordinate points, all data sets need faces, but the coordinate values of faces are different.
 What are the four coordinate points activated with? (explained earlier)
Cannot be activated with Softmax function, which is exclusive. The four coordinate points should not be connected. The sum of the output values of this function is 1. The value range of Sigmoid (only positive value) is not satisfied. When there are only some faces, there will be a coordinate value outside the picture, and a negative value will appear, but the negative value can be used. However, the training process of half face and one face is not the same concept. Generally, the sample is the whole face, at this time, the coordinates will produce a negative value. Tanh, ReLU, Y=X three activation functions can be used, Y=X is the best, because the knots need specific coordinate points, and the values calculated by the network can be used. Although tanh value range is satisfied, the value is deformed. ReLU deforms the negative half axis pair.
 How to normalize when the picture format is coordinate value?
Divide the coordinate value by the length of the longest side.
 How to normalize the image format when it is pixel value?
Divide the pixel value by 255. That is, divided by the maximum value.
(2) R network
 First, the convolution of 3 x 3 is used, and the step size is 1;
 3 x 3 pooling, step size 2;
 Next, use 3
 3 x 3 pooling, step size 2;
 Continue, Using 2x2 convolution,;
 Finally, there is a full connection layer.
Because the input size of R network is fixed and the input is the result of P network processing, it is no problem to convert the full connection into full convolution. Compared with pnetwork, rnetwork has more weight and higher precision. Finally, the output results are 1 confidence and 4 coordinate points.
Figure 30 R network structure ####(3) O networkThe result of R network processing is handed over to O network. In O network, there are four volume layers and three pooling layers, which are larger than R network. Finally, one confidence and two coordinate points (four values) are output.
Figure 31 O network structure ####(4) Tips
P network is equivalent to 12 * 12 convolution kernel.
The input of P network 12x12 refers to the suggestion box, which scans 12x12 area every time. When the input 12x12 is changed to 14x14, but because the convolution layer of the middle three layers 3x3 is equivalent to the convolution core of 12x12, changing the input size will not change the nature of the network (12x12). It is suggested that the frame size is equal to the convolution kernel size, that is, the convolution kernel of *** 12x12.

If the input P network picture size is 13x13, the P network output size is 2x2x32 (with padding). The original result 1x1x5 has five values. When the input image size is 13x13, the output has 4x5 values, that is, the image is divided into four regions. If you input a picture to get the NxNx32 size feature map, that is, to get the shape of Nx5 (one NxNx1 (confidence) and one NxNx4 (four coordinate values)), you have scanned the input picture five times. In other words, input any size of image, P network scans with 12x12 convolution kernel to get the value of NxNx5. At this time, check 5 values (confidence, coordinate value) in each region to determine whether there is a face.
The accuracy of P network is the lowest, R network is a little higher, and O network is the highest. Therefore, the size of the network input characteristic graph is gradually increasing, increasing the calculation strength and accuracy.
 How do R and O networks reverse the location of the original map?
It is the same as Pnetwork back calculation. As shown in Figure 28, the result of three network frames to
Figure 36 results of three network frames * offset replaces coordinate point * *.Use offset instead of coordinate value in the network. As shown in Figure 37, the green box indicates the recommendation box, and the red box indicates the actual box. Why offset? 1) When image pyramids are used for image zooming, it is of little significance to find the coordinate point, and the offset is of significance. In other words, after the image is zoomed, the coordinate point is not available, but the offset is still available; 2) the offset is easy to be normalized, and the coordinate value is not easy to be normalized. How is the offset calculated? The upper left corner of the actual box is relative to the upper left corner of the suggestion box, and the lower right corner of the actual box is relative to the lower right corner of the suggestion box to calculate the offset (the suggestion box corresponding to the actual box of P network result is 12x12, the suggestion box corresponding to the actual box of R network result is the actual box of P network, and the suggestion box corresponding to the actual box of O network result is the actual box of O network; the reference point of the actual box offset is not all refer to the left corner of the suggestion box The upper corner is because the coordinate value of the lower right corner of the actual box is large, which will get a larger value compared with the value of the upper left corner of the suggestion box. Using the larger value divided by the corresponding side length of the suggestion box will get a larger quotient, which cannot achieve the normalization effect. As shown in Figure 37, the offset of point b: the offset of point x is (XaX1) / W, and the offset of point Y is (YaY1) / h. When the neural network is trained well, the result is offset. How to reverse the original image position in the actual frame? X at point b is Xa offset W, Y is Ya offset * H.
Figure 37 calculation of offset
Use of offset:
Training and use.

The offset code is as follows:
# Calculate the offset value of the coordinates offset_x1 = (x1  x1_) / side_len offset_y1 = (y1  y1_) / side_len offset_x2 = (x2  x2_) / side_len offset_y2 = (y2  y2_) / side_len
15. Network training
(1) Three networks
You can train alone.
(2) Two losses:
One for confidence and one for offset.

Confidence level:
The tag uses 0 (no face) and 1 (with face), so there are two kinds of data: a group of data with face and a group of data without face. Labels: 0 and 1.

Offset:
Each image is required to have a face, so there will be an offset. So, what's the difference between data with faces? The position of the face is different, that is, the offset is different. Data: positive samples and partial samples. The offset of some samples is large. When the neural network trained part of the sample face, when using the network for face recognition, it will recognize part of the face out of the frame. The figure shows.
Wider? Face and celebA
 Usage of the wider? Face dataset:
* the face is relatively small. There are multiple faces in one picture, which can track smaller faces. Advantages: when using wider face trained network for more face recognition, the number of faces tracked will only be more, not less, and the recall rate will be higher. Disadvantages: the face in the training data set is small, and the recognition accuracy is low, that is, the probability of frame error is large.
 Use of the celebA dataset:
Advantages * *: when using the celebA trained network for face recognition, the accuracy is high. Disadvantages: however, the recall rate is low, that is, smaller faces will be discarded, that is, smaller faces cannot be framed.
 The two kinds of data sets are used differently due to different situations.
This example uses the celebA dataset.
 To view the celebA dataset (positive sample):
Figure 39 unframed Figure 40 frame up This * * frame is too large. That is to say, the celebA tag is too large, and the frame of network identification obtained by using the data of the larger tag is also too large. When * * is used * *, you can manually reduce the data frame (generally, there will be deviation if the program is used to reduce it), or increase the offset. If you want to produce highprecision results, you need to buy or build data sets. To get good results, we need about 1 million to 1.2 million face data.from PIL import Image,ImageDraw import os IMG_DIR = r"E:\Data\Data_AI\CelebA\Img\img_celeba.7z\img_celeba"#data AND_DIR = r"E:\Data\Data_AI\CelebA\Anno"#Label #Image Reading img=Image.open(os.path.join(IMG_DIR,"000002.jpg")) img.show() #Read the label and draw the label position on the image imgDraw=ImageDraw.Draw(img) imgDraw.rectangle((72,94 ,72+221 ,94+306),outline="red")#Label text value: 72 94 221 306: X1,X2, W, H. Convert to drawing coordinate value: 72, 94, 72 + 221, 94 + 306: X1,Y1,X2,Y2 img.show()
 To view the wider? Face dataset:
The frame of data label is more standard. However, the false frame rate is high (it will frame some hair, shoes (red shoes and red hair misjudgment), etc. when the adult face, false frame).
(4) Sample addition:
 Theory:
If the positive sample box is known, calculate the center point of the box; then, move the center point randomly along the top, bottom, left and right, and the maximum distance is not more than: 1 / 2 of the height and 1 / 2 of the width; then, generate a square box with the center point after translation (because the input of P, R, O networks is a square), and the maximum side length of the box is not more than the original positive sample The short side length makes the side length of the positive sample change randomly in this sample interval. In this way, a lot of frames can be drawn. The characteristics of these frames are: some frames have more faces, and some mines have fewer faces. Take this stack of boxes as positive and negative samples. So, how to distinguish positive and negative samples? Use IOU. Compare the original positive sample boxes with these boxes for IOU. The following are the recommended IOU values given in the original paper:
00.3: non face (non face data cannot be generated by using the above method.)
0.651.00: face (positive sample)
0.40.65: some faces (some samples)
0.30.4: negative sample
 Training sample proportion:
Negative sample: positive sample: partial sample: landmark = 3:1:1:2
 Actual:
Translate the center point of the original frame randomly within a certain range; take the translated point as the center point of the square to create a square; the principle of creating a square is: the minimum square side length is times of the width of the original frame and the minimum side of the middle and high school, and the maximum square side length is times of the width of the original frame and the maximum side of the middle and high school (the value is too large, it can be adjusted by itself). The code is as follows:
for _ in range(5): #Make the center of the face slightly deviate w_=np.random.randint(w*0.2,w*0.2) h_=np.random.randint(h*0.2,h*0.2) cx_=cx+w_ cy_=cy+h_ #Let the face form a square, and let the coordinates slightly deviate side_len=np.random.randint(int(min(w,h)*0.8),np.ceil(1.25*max(w,h)))#np.ceil(): round up #Coordinate point of the upper left corner of the square x1_=np.max(cx_side_len/2,0) y1_=np.max(cy_side_len/2,0) # Coordinate point of the lower right corner of the square x2_=x1_+side_len y2_=y1+side_len crop_box=np.array([x1_,y1_,x2_,y2_])

Manufacturing negative samples:

The first method:
The part outside the original data frame is cropped as a non face, as shown in Figure 33.
Figure 41 non face manufacturing method [] second method:Add samples according to the sample adding method. Use IOU to divide negative samples.
 The third method:
Add samples separately. First, set a range value: the minimum value is: face? Size, and the maximum value is: half of the shortest edge of the picture. The upper left coordinate range is: x1:0picture width minus range value, Y1:: 0picture height minus range value. The lower right coordinate range is: x2:x1 + range value, y2:y1 + range value. (this method sometimes involves partial face and sometimes complete face. The IOU value can be reduced, but the negative samples generated will be reduced.) The schematic diagram is as follows:
Figure 42 schematic The code is as follows:for i in range(5): side_len = np.random.randint(face_size, min(img_w, img_h) / 2) x_ = np.random.randint(0, img_w  side_len) y_ = np.random.randint(0, img_h  side_len) crop_box = np.array([x_, y_, x_ + side_len, y_ + side_len]) if np.max(NMS.iou(crop_box, _boxes)) < 0.3: face_crop = img.crop(crop_box) face_resize = face_crop.resize((face_size, face_size), Image.ANTIALIAS) negative_anno_file.write("negative/{0}.jpg {1} 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n".format(negative_count, 0)) negative_anno_file.flush() face_resize.save(os.path.join(negative_image_dir, "{0}.jpg".format(negative_count))) negative_count += 1
 The fourth method:
Any picture of the reptile can be picked out as non face data. The background color should be complex.
 Reminder
Using the celebA data set, we can make negative samples without downloading some sample data of face image to reduce the workload. There are three methods: sample augmentation.
 Sample situation
12x12 positive samples, negative samples, partial samples; 24x24 positive samples, negative samples, partial samples; 48x48 positive samples, negative samples, partial samples. Three networks can be trained at the same time.
 Performance requirement
Notebook training is available (each network structure is very small).

Label status:

Label: one confidence and four offsets.

Sample: positive sample, partial sample, negative sample.

Confidence degree: positive sample (1), negative sample (0), partial sample (2) (give a confidence degree value randomly, ensure the same format). Note: when training confidence, only use the confidence values of positive samples (1) and negative samples (0), not the offset value; when training migration, only use the offset values of positive samples (1) and partial samples (2), not the confidence value. Fixed, the negative sample confidence can be assigned a value at will. Programmatically separate the data.

The format of the created data is:
The IOU value of the original paper cannot be used to make samples. As shown in Figure 45 below, the negative samples obtained by manufacturing the sample set with the IOU value given in the original paper contain some faces.
Figure 45 use the IOU value of the original paper to make negative sample data set containing some faces Some of the samples are not standard and contain complete faces. Fig. 46. Figure 46 some samples contain complete face data The samples made according to the original paper are not standard, and the trained network is poor. Adjust the IOU value so that part of the samples only contain some faces, and the positive samples only contain complete faces. Negative sample tag value:
(1) Detailed explanation
First of all, for an incoming picture (Figure 50  (0)), make an image pyramid (the input size of P network is 12x12, and the incoming picture is generally larger than 12x12. At this time, make pyramid processing for the image, and box the larger face in the picture.) , get a bunch of faces, as shown in Figure 50  (1); then, pass this bunch of frames into the P network, get a bunch of frames shown in Figure 50  (2) (why do big frames cover small ones? Image pyramid. The more serious the scaling is, the larger the frame will be); then, after NMS removes some of the frames, NMS removes the frames on each picture, and the result remains a pile of frames, but less than before (Figure 50  (3)); then, according to this pile of frames, find out the area of the two frames from the original picture, and extract the original area, resize it into a 24x24 square, and then transfer it into the R network, and the R network will carry out another one Select the secondary box (Fig. 50  (4)), make NMS for the results of box selection again, and leave a pile of boxes (Fig. 50  (5)); then, find the area recognized by R network from the two boxes, and buckle it down, resize it into a square of 48x48, and then transfer it to O network, then select the 0 network again (Fig. 50  (6)), and get a box, and draw the box without matting (Fig. 50  (7)).
(0) (1) (2) (3) (4) (5) (6) (7) Figure 50 use of network ####(2) Warm tips Low P network accuracy:
Because before using the P network, the size of the picture larger than 12x12 will be reduced, which makes the picture pixel lower, fixed, and network recognition accuracy lower.
 R high network accuracy:
Rnetwork is to enlarge the area framed by Pnetwork, and then select the frame on the original image of the enlarged area. The pixel is high and the recognition accuracy is improved.
 O highest network accuracy:
Similarly, the recognition accuracy of Rnetwork is also improved. O network has the largest amount of data (input 4848 size pictures), which makes the training network with the highest accuracy.
 Question 1: how does the pyramid (Figure 40  (1)), and then figure 40  (2) correspond to each picture?
A: first of all, not together. In programming, a picture is transferred into the P network first, and then NMS is done to leave some frames (using [[]] for storage); then, the picture is given a certain scale (such as 0.7), and then transferred into the P network to get a pile of frames (stored in [[], []]); then, repeat the above operations, and finally guarantee to draw a pile of frames on the original picture (Figure 40  (2)); then, the R network is transferred into the R network, and the R network is based on the p Calculated on the frame out of the network frame; and so on
 Question 2: how to calculate confidence and offset in groups?
Answer: in a set of values [X1,Y1,X2,Y2,C] When calculating the confidence, only C is taken out; when calculating the offset, only X1,Y1,X2,Y2 are taken out.
17. Advantages and disadvantages of mtcnn
(1) Advantages
Universal tracking
(2) Disadvantages
False alarm high: it's easy to recognize the adult face from things that are not human faces. The main reason is that the network structure is shallow. Purpose: quickly filter out non face, and then use other networks.
4, Detailed analysis of project code
: observe the sample data – according to the network design loss – organize the data – design the network – training training – verification
 Observation sample data: sample data determines the final result;
 Design loss: the loss is designed, that is, the general design of the project is completed. (core and difficulty)
 Organize data: generally, the data provided can not meet their own needs. For example, 12x12, 24x24 and 48x48 in MTCNN contain positive samples, negative samples and some samples;
 Design Network: design network structure.
 Training network: use the sample data to train the network parameters, so that the parameters are optimal.
 Verification: test whether the network can achieve the expected results.
Note: the first three steps are most important.
1. Organize data
(1) Sample storage path:
Figure 51 storage form of sample in file ####(2) To create a file in Notepad:w mode of open permission: if there is a file, it will be overwritten; if there is an empty file, it will be created.
Project process: manufacturing sample  write network  manufacturing data set –
(3) Full code

Firstly, the attribute of onetime design dimension is defined;

Next, declare the picture storage path. If the path does not exist, create it;

Next, the label storage path is declared;

Then, count three kinds of samples. The image storage name is stored according to the count to ensure no repetition;

Next, read in the tag file. Traverse each line without reading the first two lines;

Next, read the contents of each line. Read out the picture name;

Next, the picture is read according to the picture name and the picture path;

Next, make data;

######Get the width and height of the picture.

Get the coordinates of the upper left corner of the suggestion box.

Gets the width and height of the suggestion box.

Get the coordinates of the lower right corner of the suggestion box.

Five keys ignored.

Filter fields. Exclude boxes that are too small. (exclude the nonstandard boxes in the sample. If the sample frame is less than 40, the learned face is very nonstandard, and the trained network frame error rate is very high, resulting in low accuracy.)

Store four coordinate points that meet the requirements.

Calculate the coordinates of face center points.

Number of randomly generated samples.

The offset value of the random center point.

Generates a new center point based on the offset value.

Make a square box and offset the box. The center point is the randomly generated center point.

Calculate the coordinate offset value. Calculate the offset between the generated box and the actual box of the data sample.

Matting and scaling (based on 12x12,24x24,48x48 size).

Is the sample positive, negative or partial?
 Pass the generated box into IOU to calculate IOU value.
 Positive sample: write label (confidence level is 1); save picture.
 Part of the sample: write the label (confidence level is 2); save the picture.
 Negative sample: write label (confidence level is 0); save picture. (in this way, there are few negative samples, or even no negative samples)

Negative samples are generated separately.
First, set a range value: the minimum value is: face? Size, and the maximum value is: half of the shortest edge of the picture. The upper left coordinate range is: x1:0picture width minus range value, Y1:: 0picture height minus range value. The lower right coordinate range is: x2:x1 + range value, y2:y1 + range value. (this method sometimes involves partial face and sometimes complete face. The IOU value can be reduced, but the negative samples generated will be reduced.)

Store samples.

Close manufacturing.
import os from PIL import Image import numpy as np from MTCNN import NMS import traceback anno_src=r"E:\Data\Data_AI\CelebA\Anno\list_bbox_celeba.txt"#Label img_dir=r"E:\Data\Data_AI\CelebA\Img\img_celeba.7z\img_celeba"#picture save_path=r"E:\project_folder\project_AI\MTCNN\celeba1"#Storage of sorted data for face_size in [12,24,48]: print("gen %i image" % face_size) #Sample image storage path positive_image_dir=os.path.join(save_path,str(face_size),"positive") negative_image_dir=os.path.join(save_path,str(face_size),"negative") part_image_dir=os.path.join(save_path,str(save_path),"part") #Determine whether the three folders exist. If not, create them. for dir_path in [positive_image_dir,negative_image_dir,part_image_dir]: if not os.path.exists(dir_path): os.makedirs(dir_path) #Sample label storage path positive_anno_filename=os.path.join(save_path,str(face_size),"positive.txt") negative_anno_filename=os.path.join(save_path,str(face_size),"negative.txt") part_anno_filename=os.path.join(save_path,str(face_size),"part.txt") #Count three kinds of samples respectively, purpose: write the picture name with the number of non repetition. positive_count=0 negative_count=0 part_count=0 try: # Creating text file in w mode with open permission positive_anno_file=open(positive_anno_filename,"w") negative_anno_file=open(negative_anno_filename,"w") part_anno_file=open(part_anno_filename,"w") """Get sample information""" #Open label for i ,line in enumerate(open(anno_src)): if i<2: continue try: """Read pictures""" #Take the content between travel texts # strs=line.strip().split("") # strs=list(filter(bool,strs)) strs = line.strip().split() image_filename=strs[0].strip()#Read the picture name. strip(): prevent spaces before and after print(image_filename) image_file=os.path.join(img_dir,image_filename) """Create data""" with Image.open(image_filename) as img:#Open the picture. img_w,img_h=img.size#Get the width and height of the picture x1=float(strs[1].strip()) y1=float(strs[2].strip()) w=float(strs[3].strip()) h=float(strs[4].strip()) x2=float(x1+w) y2=float(y1+h) #5 key points (not required temporarily) px1=0#float(strs[5].strip()) py1=0#float(strs[6].strip()) px2=0#float(strs[7].strip()) py2=0#float(strs[8].strip()) px3=0#float(strs[9].strip()) py3=0#float(strs[10].strip()) px4=0#float(strs[11].strip()) py4=0#float(strs[12].strip()) px5=0#float(strs[13].strip()) py5=0#float(strs[14].strip()) #Filter fields (exclude nonstandard boxes from the sample. If the sample frame is less than 40, the learned face is very nonstandard, and the trained network frame error rate is very high, resulting in low accuracy.) if max(w,h)<40 or x1<0 or y1<0 or w<0 or h<0: continue boxes=[[x1,y1,x2,y2]]#Store coordinate points that meet the requirements #Calculate the location of face center point cx=x1+w/2 cy=y1+h/2 #Double the number of positive and partial samples for _ in range(5): #Make the center of the face slightly deviate w_=np.random.randint(w*0.2,w*0.2) h_=np.random.randint(h*0.2,h*0.2) cx_=cx+w_ cy_=cy+h_ #Let the face form a square, and let the coordinates slightly deviate side_len=np.random.randint(int(min(w,h)*0.8),np.ceil(1.25*max(w,h)))#np.ceil(): round up #Coordinate point of the upper left corner of the square x1_=np.max(cx_side_len/2,0) y1_=np.max(cy_side_len/2,0) # Coordinate point of the lower right corner of the square x2_=x1_+side_len y2_=y1+side_len crop_box=np.array([x1_,y1_,x2_,y2_]) #Calculate the offset value of the coordinates offset_x1=(x1x1_)/side_len offset_y1=(y1y1_)/side_len offset_x2=(x2x2_)/side_len offset_y2=(y2y2_)/side_len #Five key points (not considered temporarily) offset_px1=0 # (px1x1)/side_len offset_py1 = 0 # (py1y1)/side_len offset_px2 = 0 # (px2x2)/side_len offset_py2 = 0 # (py2 y2 )/side_len offset_px3 = 0 # (px3x3)/side_len offset_py3 = 0 # (py3y3)/side_len offset_px4 = 0 # (px4x4)/side_len offset_py4 = 0 # (py4y4)/side_len offset_px5 = 0 # (px5 x5 )/side_len offset_py5 = 0 # (py5y5)/side_len #Crop and zoom the picture face_crop=img.crop(crop_box)#crop: matting face_resize=face_crop.resize((face_size,face_size))#Zoom to 12 * 12, 24 * 24, 48 * 48 #Judge whether the sample is positive, negative or partial iou=NMS.iou(crop_box,np.array(boxes))[0]#Calculate IOU value if iou >0.65: #Positive samples #Store pictures positive_anno_file.write( "positive/{0}.jpg{1}{2}{3}{4}{5}{6}{7}{8}{9}{10}{11}{12}{13}{14}{15}\n".format( positive_count,1,offset_x1,offset_y1,offset_x2, offset_y2, offset_px1, offset_py1, offset_px2, offset_py2, offset_px3, offset_py3, offset_px4, offset_py4, offset_px5, offset_py5 ) ) positive_anno_file.flush() #Storage label face_resize.save(os.path.join(positive_image_dir,"{0}.jpg".format(positive_count))) positive_count+=1 elif iou >0.4: #Partial samples # Store pictures part_anno_file.write( "part/{0}.jpg {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15}\n".format( part_count, 2, offset_x1, offset_y1, offset_x2, offset_y2, offset_px1, offset_py1, offset_px2, offset_py2, offset_px3, offset_py3, offset_px4, offset_py4, offset_px5, offset_py5) ) part_anno_file.flush() # Storage label face_resize.save(os.path.join(part_image_dir,"{0}.jpg".format(part_count))) part_count+=1 elif iou<0.3:#Negative samples (few or no negative samples) # Store pictures negative_anno_file.write( "negative/{0}.jpg {1} 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n".format(negative_count, 0) ) negative_anno_file.flush() # Storage label face_resize.save(os.path.join(negative_image_dir,"{0}.jpg".format(negative_count))) negative_count+=1 #Negative samples are generated separately (some faces will be deducted,) _boxes=np.array(boxes) for i in range(5): side_len=np.random.randint(face_size,min(img_w,img_h)/2)#The minimum value is: face_size, the maximum value is: half of the shortest edge of the picture x_=np.random.randint(0,img_wside_len)# y_=np.random.randint(0,img_h,side_len) crop_box=np.array([x_,y_,x_+side_len,y_+side_len]) if np.max(NMS.iou(crop_box,_boxes))<0.3:#Value not standard face_crop=img.crop(crop_box) face_resize=face_crop.resize((face_size,face_size),Image.ANTIALIAS) negative_anno_file.write("negative/{0}.jpg {1} 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n".format(negative_count,0)) negative_anno_file.flush() face_resize.save(os.path.join(negative_image_dir, "{0}.jpg".format(negative_count))) negative_count += 1 except Exception as e: traceback.print_exc() finally: positive_anno_file.close() negative_anno_file.close() part_anno_file.close()
2. Network structure
import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torchvision import datasets,transforms class PNet(nn.Module): def __init__(self): super(PNet,self).__init__() self.pre_layer=nn.Sequential( nn.Conv2d(3,10,kernel_size=3,stride=1),#conv1 nn.PReLU(),#PReLU1 nn.MaxPool2d(kernel_size=3,stride=2),#pool1 nn.Conv2d(10,16,kernel_size=3,stride=1),#conv2 nn.PReLU(),#PReLU2 nn.Conv2d(16,32,kernel_size=3,stride=1),#conv3 nn.PReLU()#PReLU3 ) self.conv4_1=nn.Conv2d(32,1,kernel_size=1,stride=1)#A confidence self.conv4_2=nn.Conv2d(32,4,kernel_size=1,stride=1)#Four offsets def forward(self,x): x=self.pre_layer(x) cond=F.sigmoid(self.conv4_1(x))#Activation confidence offset=self.conv4_2(x)#Inactive offset return cond,offset class RNet(nn.Module): def __init__(self): super(RNet,self).__init__() self.pre_layer=nn.Sequential( nn.Conv2d(3,28,kernel_size=3,stride=1),#conv1 nn.PReLU(),#prelu1 nn.MaxPool2d(kernel_size=3,stride=2),#pool1 nn.Conv2d(28,48,kernel_size=3,stride=1),#conv2 nn.PReLU(),#prelu2 nn.MaxPool2d(kernel_size=3,stride=2),#pool2 nn.Conv2d(48,64,kernel_size=2,stride=1),#conv3 nn.PReLU()#prelu3 ) self.conv4=nn.Linear(64*2*2,128)# conv4 self.prelu=nn.PReLU()# prelu4 """On the basis of full linearity, the reliability and offset are done directly. If we want to do it with full convolution, we need to turn the linear convolution back, which is troublesome""" # detection self.conv5_1=nn.Linear(128,1)#A confidence # bounding box regression self.conv5_2=nn.Linear(128,4)#Four offsets def forward(self,x): x=self.pre_layer(x) x=x.view(x.size(0),1)#deformation x=self.conv4(x) x=self.prelu4(x) # detection label=F.sigmoid(self.conv5_1(x)) # bounding box regression offset=self.conv5_2(x) return label,offset class ONet(nn.Module): def __init__(self): super(ONet,self).__init__() self.pre_layer=nn.Sequential( nn.Conv2d(3,32,kernel_size=3,stride=1),#conv1 nn.PReLU(),#prelu1 nn.MaxPool2d(kernel_size=3,stride=2),#Pool1 nn.Conv2d(32,64,kernel_size=3,stride=1),#conv2 nn.PReLU(),#prelu2 nn.MaxPool2d(kernel_size=3,stride=2),#Pool2 nn.Conv2d(64,64,kernel_size=3,stride=1),#conv3 nn.PReLU(),#prelu3 nn.MaxPool2d(kernel_size=2,stride=2),#Pool3 nn.Conv2d(64,128,kernel_size=2,stride=1),#conv4 nn.PReLU()#prelu4 ) self.conv5=nn.Linear(128*2*2,256)# conv5 self.prelu5=nn.PReLU()# prelu5 # detection self.conv6_1=nn.Linear(256,1) # bounding box regression self.conv6_2=nn.Linear(256,4) def forward(self,x): x=self.pre_layer(x) x=x.view(x.size(0),1) x=self.conv5(x) x=self.prelu5(x) # detection label=F.sigmoid(self.conv6_1(x)) # bounding box regression offset=self.conv6_2(x) return label,offset
3. data set

Inherit Dataset;

Rewrite three methods: add the data set to the list; read and load the positive samples, negative samples and partial samples in the tag into the list; rewrite the len method.

getitm:
 Get the picture, confidence and offset from the dataset. Think of the picture as X, and the confidence and offset as y. (dataset sample data [P,C,X1,Y1,X2,T2]).
 Take out the data, get the picture path, take out the picture.
 Take out the confidence and turn it into Tensor.
 The offset is the same as above.
 Normalize the picture.
 And return the picture, confidence and offset.

Picture changing axis
NHWC–>NChw
from torch.utils.data import Dataset import os import numpy as np import torch from PIL import Image class FaceDataset(Dataset): def __init__(self, path): self.path = path self.dataset = [] self.dataset.extend(open(os.path.join(path, "positive.txt")).readlines()) self.dataset.extend(open(os.path.join(path, "negative.txt")).readlines()) self.dataset.extend(open(os.path.join(path, "part.txt")).readlines()) def __getitem__(self, index): strs = self.dataset[index].strip().split(" ") img_path = os.path.join(self.path, strs[0]) cond = torch.Tensor([int(strs[1])]) offset = torch.Tensor([float(strs[2]), float(strs[3]), float(strs[4]), float(strs[5])]) img_data = torch.Tensor(np.array(Image.open(img_path)) / 255.  0.5) # print(img_data.shape) # # a = img_data.permute(2,0,1) # print(a.shape) return img_data, cond, offset def __len__(self): return len(self.dataset) if __name__ == '__main__': dataset = FaceDataset(r"D:\celeba4\12") print(dataset[0])
4. Training network

Three networks train at the same time:

The output is the same;

The training process is the same; (load the data and get the result)

Different data sets, different networks and the same results (loss of confidence and offset);

Write a module (Trainer) to train three networks at the same time. It mainly passes in two parameters (training data set, network) and saves the final result (parameters to be saved by the network).

Detailed analysis of trainer:
 Incoming network, save path, dataset, GPU
 Initialize the above four parameters
 The loss of confidence is activated by cross entropy function.
 The offset loss is activated by the mean square loss function.
 Use the Adam() optimizer to optimize the parameters passed in.
 If you have previously saved a model, continue training.
 Load data.
 Read picture, confidence, offset.
 First, take the picture. Pass the picture into the network and return the confidence and offset. The confidence is transformed into shape (reason 1: output result shape of P network: NCHW (N111), output shape of R and O network: NV (N1). The output shape needs to be unified. The change to NV (N1) structure and not to NCHW (N111) structure is because the label structure is NV. Reason 2: when the P network inputs a large picture, the result is N1AA, which needs to be changed into NV structure: NxAxA, 1 structure. Such as N122 – > nx4,1).
 The offset is deformed.
 Calculate confidence loss: exclude some sample labels. Take out the confidence mask with confidence less than 2 from the label, and take out the label data with confidence of 0 and 1 by using the mask; take out the confidence result with confidence less than 2 from the output confidence result of the network (because the incoming pictures in the network contain the pictures corresponding to the confidence of all confidence types). ) The loss is calculated by the label confidence and the network output confidence.
 Calculate offset loss: excludes negative sample labels. Take out the confidence mask with confidence greater than 0 from the label, and take out the label data with confidence of 1 and 2 by using the mask; take out the confidence result with confidence greater than 0 from the output confidence result of the network (because the incoming pictures in the network contain the pictures corresponding to the confidence of all confidence types). ) The label offset and network output offset are used for loss calculation.
 Calculate the sum of confidence and offset losses.
 Back propagation.
 Optimization loss.

Done.
(1) Two methods of saving and loading network
 Method 1: network parameters
After version 0.4, the shape requirements for model parameters are added in the new version. When you save parameters, you specify shapes.
Preservation:
torch.save(model.state_dict(), PATH)
When the model is saved for reasoning, only the learning parameters of the trained model need to be saved. A common PyTorch convention is to save the model with a. pt or. pth file extension.
Load:
model = TheModelClass(*args, **kwargs) model.load_state_dict(torch.load(PATH)) model.eval()
Be careful:
a. model.eval() must be called to set the dropout and batch normalization layers to evaluation mode before running the inference. If this is not done, inconsistent inferences will result.
b. The load? State? Dict() function accepts a dictionary object, rather than the path where the object is saved. This means that you must deserialize the saved state \ u dict before passing it to the load \ u state \ u dict() function.
 Method 2: network model (recommended)
Preservation:
torch.save(model, PATH)
Load:
# Model class must be defined somewhere model = torch.load(PATH) model.eval()
(2) Then train
 Use network parameters
if os.path.exists(self.save_path): net.load_state_dict(torch.load(self.save_path))
 Using the network model
if os.path.exists(self.save_path): torch.load(self.save_path)
(3) Output result shape transformation

Confidence shape transformation:

The output confidence of the last layer of P network is in the form of NCHW, and the essence is N111.
Note: N: batch; first 1: confidence level is 1 (channel is 1). Because the network was finally designed to be 1x1x32 in size.) ; the second 1: picture H (network input picture size 12x12, output size 1x1); the third 1: picture W (network input picture size 12x12, output size 1x1).

The output confidence of the last layer of R network is NV because of the linear layer, which is N1 in essence

The output confidence of the last layer of the Onetwork is in the form of NV, which is N1 in essence

The confidence label itself is a number. When the batch picture is input, the label shape changes to NV
It mainly transforms the NCHW structure of P network into N1 structure. When the input image of P network is large, the output result size is 2x2, that is N122. At this point, you need to transform to nx41 structure. As shown in the figure below, a 2x2 size feature map is obtained by passing in a picture, with the shape of 1x1x2x2, which is transformed into NV structure: 4x1, i.e. [11 11 1]. The program deals with the large feature map separately. When the input picture size is 12x12, the output confidence is [[1]]; when the picture size is greater than 12x12, the output confidence is [[1], [2], [3] ]The program can judge the confidence in turn.
 Offset shape transform:
Ditto.
output_offset = _output_offset.view(1, 4)
 10 key shape transformations
Ditto.
output_landmark = _output_landmark.view(1, 10)
(4) Calculate losses by category
 Take positive and negative samples from the label
That is, some samples are excluded. The following figure is text data, in which positive and negative samples are taken out. Take out the sample data with confidence of 0 and 1, and exclude the sample data with confidence of 2.
Figure 53 label form [x] exercise method 1:import numpy as np a= np.array([8,2,7,5,1,4]) print(a<5)#Boolean less than 5 print(a[a<[5]])#Value less than 5
Print results:
[False True False False True True] [2 1 4]
 Practice method 2:
import numpy as np a= np.array([8,2,7,5,1,4]) print(np.where(a<5))#Index value less than 5 print(a[np.where(a<5)])#Value less than 5
Print results:
(array([1, 4, 5], dtype=int64),) [2 1 4]
 Practice method 3:
import torch a=torch.Tensor([1,2,3,4,5]) print(a<4)#Output Boolean. In Python, 1 and 0 are used for True and False. print(torch.lt(a,4))#lt: less than; gt: greater than; eq: equal to; le: less than equal to; ge: greater than equal to #The following two methods are equivalent print(a[a<4]) print(torch.masked_select(a,a<4))
Print results:
tensor([ 1, 1, 1, 0, 0]) tensor([ 1, 1, 1, 0, 0]) tensor([1., 2., 3.]) tensor([1., 2., 3.])
Code: (practice method 3)
category_mask=torch.lt(category_,2)#Exclude some samples. Take out the mask with confidence less than 2 category=torch.masked_select(category_,category_mask)#Extract data with confidence of 0 and 1 from the label according to the mask
 Take positive and negative samples from network results
 Final code
category_mask=torch.lt(category_,2)#Exclude some samples. Take out the mask with confidence less than 2 category=torch.masked_select(category_,category_mask)#Extract data with confidence of 0 and 1 from the label according to the mask output_category=torch.masked_select(output_category,category_mask)#Data with confidence of 0 and 1 are extracted from the results according to the mask. cls_loss=self.cls_loss_fn(output_category,category)
(5) Calculate the loss of offset
 Practice removing 2D array offsets
import torch import numpy as np a=torch.Tensor([[1,2],[3,4],[5,6],[7,8],[9,10]]) b=torch.Tensor([1,2,3,4,5]) #One dimension takes two dimensions print(a[b>3])
Print results:
tensor([[ 7., 8.], [ 9., 10.]])
 Final code
offset_mask=torch.gt(category_,0)#Negative sample does not participate in operation offset=offset_[offset_mask] output_offset=_output_offset[offset_mask] offset_loss =self.offset_loss_fn(output_offset,offset)
(6) Print loss
Numpy does not support CUDA, so GPU cannot directly convert to numpy. To convert CUDA to cpu, and to. data (loss is value), and then to numpy.
print(" loss:", loss.cpu().data.numpy(), " cls_loss:", cls_loss.cpu().data.numpy(), " offset_loss",offset_loss.cpu().data.numpy())
(7) Save model
torch.save(self.net.state_dict(), self.save_path) print("save success")#Every time you save it, it shows that it is saved successfully
(8) Training network code
import os from torch.utils.data import DataLoader import torch from torch import nn import torch.optim as optim from MTCNN.simpling import FaceDataset class Trainer: def __init__(self,net,save_path,dataset_path,isCuda=True): self.net=net self.save_path=save_path self.dataset_path=dataset_path self.isCuda=isCuda if self.isCuda: self.net.cuda() self.cls_loss_fn=nn.BCELoss()#the Binary Cross Entropy. Loss of confidence self.offset_loss_fn=nn.MSELoss()#Loss of mean square deviation. self.optimizer=optim.Adam(self.net.parameters())#Optimizer #Load when there is a network model. Function: then train. if os.path.exists(self.save_path): net.load_state_dict(torch.load(self.save_path)) def train(self): faceDataset=FaceDataset(self.dataset_path) dataloader=DataLoader(faceDataset,bath_size=512,shuffle=True,num_workers=4)#Data read to memory while True: for i,(img_data_,category_,offset_) in enumerate(dataloader):#Picture, confidence, offset if self.isCuda: img_data_=img_data_.cuda() category_=category_.cuda() offset_=offset_.cuda() _output_category, _output_offset =self.net(img_data_)#Enter a picture to return confidence and offset output_category =_output_category.view(1,1)#Confidence shape transformation. P network output shape: nchw. R network and O network output shape: nv # output_offset = _output_offset.view(1, 4)#Offset shape transform # output_landmark = _output_landmark.view(1, 10)#Is a key shape transform. (not considered temporarily) # Calculate losses by category category_mask=torch.lt(category_,2)#Exclude some samples. Take out the mask with confidence less than 2 category=torch.masked_select(category_,category_mask)#Extract data with confidence of 0 and 1 from the label according to the mask output_category=torch.masked_select(output_category,category_mask)#Data with confidence of 0 and 1 are extracted from the results according to the mask. cls_loss=self.cls_loss_fn(output_category,category) #Calculate the loss of the bond offset_mask=torch.gt(category_,0)#Negative sample does not participate in operation offset=offset_[offset_mask] output_offset=_output_offset[offset_mask] offset_loss =self.offset_loss_fn(output_offset,offset) loss=cls_loss+offset_loss self.optimizer.zero_grad() loss.backward() self.optimizer.step() print(" loss:", loss.cpu().data.numpy(), " cls_loss:", cls_loss.cpu().data.numpy(), " offset_loss", offset_loss.cpu().data.numpy()) torch.save(self.net.state_dict(), self.save_path) print("save success")
(9) Precautions

When the results meet the requirements, the training can be closed directly because there are saved parameters.

1050 or 1060 for 4872 hours, perfect.

After training for more than 72 hours, there will be over fitting and over learning, and some things that are not human faces will be regarded as human faces.

When the network talks about 0.2, it drops very slowly. Do not turn it off.

P network can be reduced to about 0.02.

There are many photos in the data set, and the network will accept them as human faces.
5. Separate training network at the same time
 P network
import nets import train if __name__ == '__main__': net = nets.PNet() trainer = train.Trainer(net, './param/pnet.pt', r"C:\celeba4\12")#Transfer in the network, fill in and save the parameter location, and transfer in the data set trainer.train()
 R network
import nets import train if __name__ == '__main__': net = nets.RNet() trainer = train.Trainer(net, './param/rnet.pt', r"C:\celeba4\24") trainer.train()
 O network
import nets import train if __name__ == '__main__': net = nets.ONet() trainer = train.Trainer(net, './param/onet.pt', r"C:\celeba4\48") trainer.train()
6. Use of network
(1) Initialization
 Import three network weights
def __init__(self, pnet_param="./param/pnet.pt", rnet_param="./param/rnet.pt", onet_param="./param/onet.pt", isCuda=True):#Read in three network weights
 Instantiate three networks
#Instantiate three networks self.pnet = nets.PNet() self.rnet = nets.RNet() self.onet = nets.ONet()
 Use CUDA or not
self.isCuda = isCuda if self.isCuda: self.pnet.cuda() self.rnet.cuda() self.onet.cuda()
 Load parameters to network
self.pnet.load_state_dict(torch.load(pnet_param)) self.rnet.load_state_dict(torch.load(rnet_param)) self.onet.load_state_dict(torch.load(onet_param))

batch normalization
Normalization is different in training and use. When training, use a batch of pictures, and use a picture. Mean and variance are different. When using the network, it does not use the batch normalization of picture data when using, but uses the batch normalization when training the network.
The following is the code of batch normalization when using the training network. (this example does not use batch normalization. You can load batch normalization by yourself)
self.pnet.eval() self.rnet.eval() self.onet.eval()

Picture to Tensor
ToTensor():
Converts a PIL Image or numpy.ndarray (H x W x C) in the range[0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0]if the PIL Image belongs to one of the modes (L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1)or if the numpy.ndarray has dtype = np.uint8
self.__image_transform = transforms.Compose([ transforms.ToTensor() ])
 Final code
def __init__(self, pnet_param="./param/pnet.pt", rnet_param="./param/rnet.pt", onet_param="./param/onet.pt", isCuda=True):#Read in three network weights self.isCuda = isCuda #Instantiate three networks self.pnet = nets.PNet() self.rnet = nets.RNet() self.onet = nets.ONet() if self.isCuda: self.pnet.cuda() self.rnet.cuda() self.onet.cuda() #Load parameters to network self.pnet.load_state_dict(torch.load(pnet_param)) self.rnet.load_state_dict(torch.load(rnet_param)) self.onet.load_state_dict(torch.load(onet_param)) # self.pnet.eval() self.rnet.eval() self.onet.eval() self.__image_transform = transforms.Compose([ transforms.ToTensor() ])
(2) P network

Analysis

Pass in a picture and get a bunch of frames (boxes = []: receive). The format is: [x1,y1,x2,y2,c]. The format is the same as that of IOU.

Pass in a picture, width and height, get the minimum side length of the picture, which is used to control the making of pyramid. (minimum side length reduced to 12)

Turn the original image into a Tensor, put it in CUDA, and raise the latitude. Because there is no batch for an incoming picture, you need to raise a dimension to ensure that the dimensions are the same. The dimension changes to 1CHW.

Increase the image data of bitmap to P network to get the confidence and offset. At this time, the format of confidence and offset is: NCHW.

Confidence. Take N and C. Format: 1x1x2x2
_cls[0][0].cpu().data# _cls[0][0]: take N and C
Figure 56 offset [] if the retention reliability is greater than 0.6, then the index with confidence greater than 0.6 will be taken out. (the result with confidence greater than 0.6 is face. The confidence given here is low and the result is poor. The reason is that it's better to choose by mistake than to let it go.)_offest[0].cpu().data#_Offer [0]: Select
idxs = torch.nonzero(torch.gt(cls, 0.6))
 Reverse operation of characteristic graph
Find these reserved result areas on the original image. You need to know: index (two values), offset, confidence, scaling.
for idx in idxs: boxes.append(self.__box(idx, offest, cls[idx[0], idx[1]], scale))#cls[idx[0], idx[1]]: confidence level
Reverse the coordinates of the upper left corner and the lower right corner of the original drawing:
Top left corner in the original image: (index * step size) / scale
Bottom right corner of the original image: (index * step size + convolution kernel size) / scaling
_x1 = (start_index[1] * stride) / scale _y1 = (start_index[0] * stride) / scale _x2 = (start_index[1] * stride + side_len) / scale _y2 = (start_index[0] * stride + side_len) / scale
Calculate the coordinate point of the box according to the offset:
Offset formula: (inner xouter x) / outer border
x1 = _x1 + ow * _offset[0] y1 = _y1 + oh * _offset[1] x2 = _x2 + ow * _offset[2] y2 = _y2 + oh * _offset[3]
Total code:
def __box(self, start_index, offset, cls, scale, stride=2, side_len=12):#Index, offset, confidence, scaling, step size (fixed value), convolution kernel (12). #Upper left and lower right corner of the original _x1 = (start_index[1] * stride) / scale _y1 = (start_index[0] * stride) / scale _x2 = (start_index[1] * stride + side_len) / scale _y2 = (start_index[0] * stride + side_len) / scale ow = _x2  _x1 oh = _y2  _y1 _offset = offset[:, start_index[0], start_index[1]] x1 = _x1 + ow * _offset[0] y1 = _y1 + oh * _offset[1] x2 = _x2 + ow * _offset[2] y2 = _y2 + oh * _offset[3] return [x1, y1, x2, y2, cls]#P network final result. The shape is the same as that of IOU.
 Network tuning
Low confidence and high threshold cause problems: there are many frames left in P network, which means that the pictures transferred into R network are large, the computation is large, and the network is slow.
 P network confidence
idxs = torch.nonzero(torch.gt(cls, 0.6))
 P network threshold
return utils.nms(np.array(boxes), 0.5)
 Final code
def __pnet_detect(self, image):#Incoming images boxes = []#Receive results (a bunch of frames) img = image#picture w, h = img.size#Get picture width and height min_side_len = min(w, h)#Get the minimum side length to make pyramid scale = 1#Scale to 1 while min_side_len > 12: img_data = self.__image_transform(img)# if self.isCuda: img_data = img_data.cuda() img_data.unsqueeze_(0) _cls, _offest = self.pnet(img_data) cls, offest = _cls[0][0].cpu().data, _offest[0].cpu().data idxs = torch.nonzero(torch.gt(cls, 0.6)) for idx in idxs: boxes.append(self.__box(idx, offest, cls[idx[0], idx[1]], scale)) scale *= 0.7 _w = int(w * scale) _h = int(h * scale) img = img.resize((_w, _h)) min_side_len = min(_w, _h) return utils.nms(np.array(boxes), 0.5) # Restore the regression to the original map def __box(self, start_index, offset, cls, scale, stride=2, side_len=12): _x1 = (start_index[1] * stride) / scale _y1 = (start_index[0] * stride) / scale _x2 = (start_index[1] * stride + side_len) / scale _y2 = (start_index[0] * stride + side_len) / scale ow = _x2  _x1 oh = _y2  _y1 _offset = offset[:, start_index[0], start_index[1]] x1 = _x1 + ow * _offset[0] y1 = _y1 + oh * _offset[1] x2 = _x2 + ow * _offset[2] y2 = _y2 + oh * _offset[3] return [x1, y1, x2, y2, cls]
(3) R network

Analysis

Define an empty list to hold the deducted data.

Pass in the frame of P network.

The output of P network may be rectangle or square. First, you need to turn the rectangle into a square and fill it with the background of the original image (using white filling will reduce the norm of the network).
def convert_to_square(bbox): square_bbox = bbox.copy() if bbox.shape[0] == 0: return np.array([]) h = bbox[:, 3]  bbox[:, 1] w = bbox[:, 2]  bbox[:, 0] max_side = np.maximum(h, w) square_bbox[:, 0] = bbox[:, 0] + w * 0.5  max_side * 0.5 square_bbox[:, 1] = bbox[:, 1] + h * 0.5  max_side * 0.5 square_bbox[:, 2] = square_bbox[:, 0] + max_side square_bbox[:, 3] = square_bbox[:, 1] + max_side return square_bbox
 Take out P network frame according to the result of R network confidence
#R network filter confidence greater than 0.6 idxs, _ = np.where(cls > 0.6) for idx in idxs: _box = _pnet_boxes[idx]#Fetch box
 Final code
def __rnet_detect(self, image, pnet_boxes): _img_dataset = []#Store the deducted data _pnet_boxes = utils.convert_to_square(pnet_boxes)#Frame of incoming P network #Get four coordinate points of the square for _box in _pnet_boxes: _x1 = int(_box[0]) _y1 = int(_box[1]) _x2 = int(_box[2]) _y2 = int(_box[3]) #Cutout img = image.crop((_x1, _y1, _x2, _y2)) img = img.resize((24, 24)) img_data = self.__image_transform(img) _img_dataset.append(img_data) img_dataset =torch.stack(_img_dataset) if self.isCuda: img_dataset = img_dataset.cuda() _cls, _offset = self.rnet(img_dataset) cls = _cls.cpu().data.numpy() offset = _offset.cpu().data.numpy() boxes = [] idxs, _ = np.where(cls > 0.6) for idx in idxs: _box = _pnet_boxes[idx] _x1 = int(_box[0]) _y1 = int(_box[1]) _x2 = int(_box[2]) _y2 = int(_box[3]) ow = _x2  _x1 oh = _y2  _y1 x1 = _x1 + ow * offset[idx][0] y1 = _y1 + oh * offset[idx][1] x2 = _x2 + ow * offset[idx][2] y2 = _y2 + oh * offset[idx][3] boxes.append([x1, y1, x2, y2, cls[idx][0]]) return utils.nms(np.array(boxes), 0.5)
(4) O network
Same as R network.
(5) Use network code
 detect
import torch from PIL import Image from PIL import ImageDraw import numpy as np from MTCNN.tool import utils from MTCNN import nets from torchvision import transforms import time class Detector: def __init__(self, pnet_param="./param/pnet.pt", rnet_param="./param/rnet.pt", onet_param="./param/onet.pt", isCuda=True):#Read in three network weights self.isCuda = isCuda #Instantiate three networks self.pnet = nets.PNet() self.rnet = nets.RNet() self.onet = nets.ONet() if self.isCuda: self.pnet.cuda() self.rnet.cuda() self.onet.cuda() #Load parameters to network self.pnet.load_state_dict(torch.load(pnet_param)) self.rnet.load_state_dict(torch.load(rnet_param)) self.onet.load_state_dict(torch.load(onet_param)) # self.pnet.eval() self.rnet.eval() self.onet.eval() self.__image_transform = transforms.Compose([ transforms.ToTensor() ]) def detect(self, image): start_time = time.time() pnet_boxes = self.__pnet_detect(image) # When there is no face in P network, an empty array is passed in if pnet_boxes.shape[0] == 0: return np.array([]) end_time = time.time() t_pnet = end_time  start_time # return pnet_boxes start_time = time.time() # rnet_boxes = self.__rnet_detect(image, pnet_boxes) # print( rnet_boxes) if rnet_boxes.shape[0] == 0: return np.array([]) end_time = time.time() t_rnet = end_time  start_time start_time = time.time() onet_boxes = self.__onet_detect(image, rnet_boxes) if onet_boxes.shape[0] == 0: return np.array([]) end_time = time.time() t_onet = end_time  start_time t_sum = t_pnet + t_rnet + t_onet print("total:{0} pnet:{1} rnet:{2} onet:{3}".format(t_sum, t_pnet, t_rnet, t_onet)) return onet_boxes def __rnet_detect(self, image, pnet_boxes): _img_dataset = []#Store the deducted data _pnet_boxes = utils.convert_to_square(pnet_boxes)#Frame of incoming P network #Get four coordinate points of the square for _box in _pnet_boxes: _x1 = int(_box[0]) _y1 = int(_box[1]) _x2 = int(_box[2]) _y2 = int(_box[3]) #Cutout img = image.crop((_x1, _y1, _x2, _y2)) #Change to 24*24 img = img.resize((24, 24)) #deformation img_data = self.__image_transform(img) #Close to list _img_dataset.append(img_data) #Assemble into matrix img_dataset =torch.stack(_img_dataset) if self.isCuda: img_dataset = img_dataset.cuda() _cls, _offset = self.rnet(img_dataset) cls = _cls.cpu().data.numpy() offset = _offset.cpu().data.numpy() boxes = [] #R network filter confidence greater than 0.6 idxs, _ = np.where(cls > 0.6) #Get four offsets for idx in idxs: _box = _pnet_boxes[idx]# _x1 = int(_box[0]) _y1 = int(_box[1]) _x2 = int(_box[2]) _y2 = int(_box[3]) ow = _x2  _x1 oh = _y2  _y1 x1 = _x1 + ow * offset[idx][0] y1 = _y1 + oh * offset[idx][1] x2 = _x2 + ow * offset[idx][2] y2 = _y2 + oh * offset[idx][3] boxes.append([x1, y1, x2, y2, cls[idx][0]]) return utils.nms(np.array(boxes), 0.5) def __onet_detect(self, image, rnet_boxes): _img_dataset = [] _rnet_boxes = utils.convert_to_square(rnet_boxes) for _box in _rnet_boxes: _x1 = int(_box[0]) _y1 = int(_box[1]) _x2 = int(_box[2]) _y2 = int(_box[3]) img = image.crop((_x1, _y1, _x2, _y2)) img = img.resize((48, 48)) img_data = self.__image_transform(img) _img_dataset.append(img_data) img_dataset = torch.stack(_img_dataset) if self.isCuda: img_dataset = img_dataset.cuda() _cls, _offset = self.onet(img_dataset) cls = _cls.cpu().data.numpy() offset = _offset.cpu().data.numpy() boxes = [] idxs, _ = np.where(cls > 0.97) for idx in idxs: _box = _rnet_boxes[idx] _x1 = int(_box[0]) _y1 = int(_box[1]) _x2 = int(_box[2]) _y2 = int(_box[3]) ow = _x2  _x1 oh = _y2  _y1 x1 = _x1 + ow * offset[idx][0] y1 = _y1 + oh * offset[idx][1] x2 = _x2 + ow * offset[idx][2] y2 = _y2 + oh * offset[idx][3] boxes.append([x1, y1, x2, y2, cls[idx][0]]) return utils.nms(np.array(boxes), 0.7, isMin=True)#Divided by minimum area def __pnet_detect(self, image): boxes = []# img = image w, h = img.size min_side_len = min(w, h) scale = 1 while min_side_len > 12: img_data = self.__image_transform(img) if self.isCuda: img_data = img_data.cuda() img_data.unsqueeze_(0) _cls, _offest = self.pnet(img_data) cls, offest = _cls[0][0].cpu().data, _offest[0].cpu().data idxs = torch.nonzero(torch.gt(cls, 0.6)) for idx in idxs: boxes.append(self.__box(idx, offest, cls[idx[0], idx[1]], scale)) #Start scaling scale *= 0.7 _w = int(w * scale) _h = int(h * scale) img = img.resize((_w, _h))#zoom min_side_len = min(_w, _h)#Minimum side length return utils.nms(np.array(boxes), 0.5)#Threshold 0.5. Keep boxes with IOU less than 0.5 # Restore the regression to the original map def __box(self, start_index, offset, cls, scale, stride=2, side_len=12): _x1 = (start_index[1] * stride) / scale _y1 = (start_index[0] * stride) / scale _x2 = (start_index[1] * stride + side_len) / scale _y2 = (start_index[0] * stride + side_len) / scale ow = _x2  _x1 oh = _y2  _y1 _offset = offset[:, start_index[0], start_index[1]] x1 = _x1 + ow * _offset[0] y1 = _y1 + oh * _offset[1] x2 = _x2 + ow * _offset[2] y2 = _y2 + oh * _offset[3] return [x1, y1, x2, y2, cls] if __name__ == '__main__': image_file = r"D:\\20180222172119.jpg" detector = Detector() with Image.open(image_file) as im: # boxes = detector.detect(im) # print("") boxes = detector.detect(im) print(im.size) imDraw = ImageDraw.Draw(im) for box in boxes: x1 = int(box[0]) y1 = int(box[1]) x2 = int(box[2]) y2 = int(box[3]) print(box[4]) imDraw.rectangle((x1, y1, x2, y2), outline='red') im.show()
7.NMS&IOU
import numpy as np def iou(box, boxes, isMin = False): box_area = (box[2]  box[0]) * (box[3]  box[1]) area = (boxes[:, 2]  boxes[:, 0]) * (boxes[:, 3]  boxes[:, 1]) xx1 = np.maximum(box[0], boxes[:, 0]) yy1 = np.maximum(box[1], boxes[:, 1]) xx2 = np.minimum(box[2], boxes[:, 2]) yy2 = np.minimum(box[3], boxes[:, 3]) w = np.maximum(0, xx2  xx1) h = np.maximum(0, yy2  yy1) inter = w * h if isMin: ovr = np.true_divide(inter, np.minimum(box_area, area)) else: ovr = np.true_divide(inter, (box_area + area  inter)) return ovr def nms(boxes, thresh=0.3, isMin = False): if boxes.shape[0] == 0: return np.array([]) _boxes = boxes[(boxes[:, 4]).argsort()] r_boxes = [] while _boxes.shape[0] > 1: a_box = _boxes[0] b_boxes = _boxes[1:] r_boxes.append(a_box) # print(iou(a_box, b_boxes)) index = np.where(iou(a_box, b_boxes,isMin) < thresh) _boxes = b_boxes[index] if _boxes.shape[0] > 0: r_boxes.append(_boxes[0]) return np.stack(r_boxes) def convert_to_square(bbox): square_bbox = bbox.copy() if bbox.shape[0] == 0: return np.array([]) h = bbox[:, 3]  bbox[:, 1] w = bbox[:, 2]  bbox[:, 0] max_side = np.maximum(h, w) square_bbox[:, 0] = bbox[:, 0] + w * 0.5  max_side * 0.5 square_bbox[:, 1] = bbox[:, 1] + h * 0.5  max_side * 0.5 square_bbox[:, 2] = square_bbox[:, 0] + max_side square_bbox[:, 3] = square_bbox[:, 1] + max_side return square_bbox def prewhiten(x): mean = np.mean(x) std = np.std(x) std_adj = np.maximum(std, 1.0/np.sqrt(x.size)) y = np.multiply(np.subtract(x, mean), 1/std_adj) return y if __name__ == '__main__': # a = np.array([1,1,11,11]) # bs = np.array([[1,1,10,10],[11,11,20,20]]) # print(iou(a,bs)) bs = np.array([[1, 1, 10, 10, 40], [1, 1, 9, 9, 10], [9, 8, 13, 20, 15], [6, 11, 18, 17, 13]]) # print(bs[:,3].argsort()) print(nms(bs))
print(box[4]) imDraw.rectangle((x1, y1, x2, y2), outline='red') im.show()
### 7.NMS&IOU ```python import numpy as np def iou(box, boxes, isMin = False): box_area = (box[2]  box[0]) * (box[3]  box[1]) area = (boxes[:, 2]  boxes[:, 0]) * (boxes[:, 3]  boxes[:, 1]) xx1 = np.maximum(box[0], boxes[:, 0]) yy1 = np.maximum(box[1], boxes[:, 1]) xx2 = np.minimum(box[2], boxes[:, 2]) yy2 = np.minimum(box[3], boxes[:, 3]) w = np.maximum(0, xx2  xx1) h = np.maximum(0, yy2  yy1) inter = w * h if isMin: ovr = np.true_divide(inter, np.minimum(box_area, area)) else: ovr = np.true_divide(inter, (box_area + area  inter)) return ovr def nms(boxes, thresh=0.3, isMin = False): if boxes.shape[0] == 0: return np.array([]) _boxes = boxes[(boxes[:, 4]).argsort()] r_boxes = [] while _boxes.shape[0] > 1: a_box = _boxes[0] b_boxes = _boxes[1:] r_boxes.append(a_box) # print(iou(a_box, b_boxes)) index = np.where(iou(a_box, b_boxes,isMin) < thresh) _boxes = b_boxes[index] if _boxes.shape[0] > 0: r_boxes.append(_boxes[0]) return np.stack(r_boxes) def convert_to_square(bbox): square_bbox = bbox.copy() if bbox.shape[0] == 0: return np.array([]) h = bbox[:, 3]  bbox[:, 1] w = bbox[:, 2]  bbox[:, 0] max_side = np.maximum(h, w) square_bbox[:, 0] = bbox[:, 0] + w * 0.5  max_side * 0.5 square_bbox[:, 1] = bbox[:, 1] + h * 0.5  max_side * 0.5 square_bbox[:, 2] = square_bbox[:, 0] + max_side square_bbox[:, 3] = square_bbox[:, 1] + max_side return square_bbox def prewhiten(x): mean = np.mean(x) std = np.std(x) std_adj = np.maximum(std, 1.0/np.sqrt(x.size)) y = np.multiply(np.subtract(x, mean), 1/std_adj) return y if __name__ == '__main__': # a = np.array([1,1,11,11]) # bs = np.array([[1,1,10,10],[11,11,20,20]]) # print(iou(a,bs)) bs = np.array([[1, 1, 10, 10, 40], [1, 1, 9, 9, 10], [9, 8, 13, 20, 15], [6, 11, 18, 17, 13]]) # print(bs[:,3].argsort()) print(nms(bs))