"MTCNN" more than 20000 words textbook style explanation

In this paper, MTCNN, one of the most classical neural networks in artificial intelligence deep learning, is explained in detail. More than 20000 words of content, mainly from the theoretical and practical aspects of MTCNN for detailed analysis, comparable to textbooks. If you don't read, start your journey!

Catalog:

  • Basic chatting
  • Face recognition
  • Theoretical analysis of MTCNN
  • Detailed analysis of project code

1, Basic chat

1. identification:

(1) Digital recognition: ideal state. The image size is the same, and the interference items (noise) are few;

(2) Face recognition: real state.

2. In video recognition.

One second frame, that is, 24 pictures.

3. There is a upper limit for the number of people identified by the punch.

Less: 50-70; more: 100-200.

4. Acquaintance identification.

Train station, access control, etc. At present, acquaintance recognition can only achieve more than 80%.

5. Stranger recognition.

High value.

6. companies.

Kuangshi technology, Shangtang technology.

7.IOU

Key points and difficulties.

8.NMS

Key points and difficulties.

9. Reverse operation of characteristic graph

Key points and difficulties.

2, Face recognition

(1) Face detection

Track the faces in the picture.

(2) Feature extraction

Take out the face part, put it into the neural network, extract the features, get the feature vector.

(3) Face comparison

The feature vector is used to compare the face features in the existing registry. Make cosine similarity comparison.

Note: among them, face detection is the most important.

3, Theoretical analysis of MTCNN

1. History of neural network

2. History of neural networks

(1) RCNN variants

  • RCNN–>fast RCNN–>faster RCNN–>YOLO ( V1,V2,V3 )

Among them, YOLO v3

  • YOLO V2 – > yolo9000 (can recognize 9000 objects)

  • YOLO–>SSD

3. characteristics

4. cascade

  • decompose

  • Series connection

5. R & D Institute

Mr. Qiao Yu, Shenzhen Institute of advanced technology, Chinese Academy of Sciences

6. Losses and models

(1) Loss:

The most important part of neural network is loss. When the loss is solved, 90% of the neural network project problems are basically solved. Loss is the ultimate goal of solving the problem. The valuable paper is to study on the design loss; the paper only focuses on the model, and the paper value is not high.

(2) Model:

Improve network accuracy.

7. Image tracking

(1) Single target tracking

There is only one target to look for on one image. There are two methods for single target tracking:

  • Four coordinate values of the upper left corner and the lower right corner of the target area are found on the image.

    • It is simple, easy to implement and used by most networks.

    • Output four values: upper left corner and lower right corner. Sample labels do the same.

  • Find the center point, width and height of the target area on the image.

    • Disadvantages: the center point has a great influence on the frame; the calculation of the center point and the width and height is large: first, you need to find the upper left corner and the lower right corner, and then calculate the center point.

    • Advantage: if the center point is outside the picture, there are some tracking objects in the picture (such as half a cat). In this case, we need to do special processing in the sample, but in general, tracking half target is very few.

(2) Multitarget tracking

  • Three target tracking

    • Solution

      Three sets of values. Each set of values (four coordinate points) represents a target. Use the three sets of values box to select three targets.

    • Existing problems

      Only one target in the box. See Figure 1 below. Like apple, the three people always choose the biggest apple.

Figure 1 may be framed like this -[] problem solving

This problem should not be solved from the point of label. For example, in the street, there are many backgrounds such as buildings, vehicles, etc. in this environment, people are identified because they have human characteristics. In the same way, let neural network recognize how many people there are on a picture, and find the location of people. How did you find it? It's because people have their own characteristics and they have common characteristics. Therefore, in the process of multi-target tracking, the simplification problem is to let the neural network know whether it is human or not, that is, just let the neural network train to extract human characteristics. Therefore, we should not take a group of people training network, just need the network neural network to do one thing, that is to extract the characteristics of people, that is, take a person training network. In the output value, the confidence degree (0-1) indicates whether it is a person, which belongs to the two classification problem. The other four values are coordinates. When there is a face, the confidence is close to 1, at this time, the output of meaningful coordinate value; when the confidence is close to 0, the output of coordinate value is meaningless, output four zeros, four zeros are meaningless.

The three boxes will go to the same goal because they don't matter. When you think of the problem as taking an apple, three people take three apples, it's time for three people to line up, one person takes an apple, so the problem is solved. Similarly, you can sort three boxes. When a box is framed, you can't frame this target next time. You can only frame the remaining targets.

  • Multiple target tracking

    • Solution thought

    When the problem is replaced by 10 or more goals, it can be solved by using the idea of circulation. First, design a network, train the network, output five values, one confidence and four coordinate values; then, use the network, when the frame to a person's face, then use the cyclic thought to continue to frame the remaining faces.

    During use, start from the upper left corner, as shown in Figure 2 below. This scanning method is similar to convolution, but it is likely to divide a face into two parts, as shown in Figure 3 below. The problem is the step size. The solution is to offset the first result, as shown in Figures 4 and 5. In fact, the step size is a kind of offset, and the step size is a little smaller. This is a face that will be framed many times, as shown in Figure 6, to resolve the drop below. At this time, the box we give is a fixed box. The problem is that some faces are relatively large, as shown in Figure 7. There are two solutions here: multi suggestion box and image pyramid. Multi suggestion box: use many boxes to scan. Prepare a group of boxes (Figure 8), and each box has three sizes, a total of 9 boxes (yo uses multi suggestion box), use the square to frame the face, use the vertical line box for the electric pole, etc. Image pyramid: fix the box, keep the box unchanged, and scale the image. When the image is scaled to the same size as the box, stop scaling. The scaling code is implemented using a while loop.

Figure 2 face recognition (frame selection) using trained network

Figure 3 divide a face into two parts

Figure 4 no frame for the first time

Figure 5 offset the first result ! [insert picture description here] (https://img-blog.csdnimg.cn/20200211203702512.png? X-oss-process = image / watermark, type ﹐ zmfuz3pozw5nagvpdgk, shadow ﹐ 10, text ﹐ ahr0chm6ly9ibg9nlmnzg4ubmv0l3npbmf0xzm5nzgznjy0, size ﹐ 16, color ﹐ ffff, t ﹐ 70) Figure 6 a face framed many times

Figure 7 disadvantages of fixed frame

Figure 8 multi suggestion box -[] overall process

Scan image from left to right – > don't use too large step size – > use image pyramid in MTCNN to solve the problem – > when the step size is too small, a face will be framed by many boxes – > use NMS to solve the problem, and keep the box with high reliability

  • Reminder

30 * 30 face can be recognized, which depends on the production of samples. As shown in Figure 9.

As shown in Figure 10, the PS situation is excluded.

NTCNN suggests a face with a minimum frame size of 12x12 (minimum lower limit), as shown in Figure 11, the face of 12x12 is magnified 2850 times.

MTCNN is more suitable for face.

Using the trained network to recognize people, using the confidence degree and four coordinate values is the basis of recognition. Next, input the image area into the network, then use the pyramid method to scale the original image for face recognition, and then frame the face. Each time, cut a part of the original image and put it into the network for recognition. When scaling, scale according to the shortest side (according to the maximum When the edge is zoomed, it cannot be divided). The zoom technique is to use the shortest edge. The zoom result is shown in Figure 12, and the step size is 2 to scan the translation box (when the shortest edge of the original image is 12, the zoom is stopped).

It is easy to train and difficult to use.

Figure 9 30 * 30 Face Recognition

Fig. 10 phenomenon of big frame with small frame

Figure 11 face magnification of 12 * 12: 2850 times

Figure 12 zoom technique ### 8.IOU
  • Overlap algorithm.

  • Calculates the overlap of the two boxes.

  • Intersection / Union.

(1) Purpose:

Framing.

(2) Function:

Identifies whether it is a stack of boxes. As shown in Figure 13. When IOU is 0, it means not a bunch of boxes.

Fig. 13 IOU ####(3) Intersection calculation:

The simple calculation is shown in the left figure in Figure 14. The corner coordinates can be used for simple calculation. The difficulty lies in the calculation of intersection on the right side in Figure 14: first calculate the coordinates of the intersection point, and then calculate the intersection area.

Find a general method: Figure 15, as follows:

  • Coordinates of the upper left corner of the intersection: the upper left corner X and Y in the two original boxes take the larger value respectively;

  • Coordinates of the lower right corner of the intersection: the lower right corner X and Y of the two original intersecting boxes take the smaller values respectively.

Figure 14 intersection calculation

Figure 15 general method of intersection calculation ####(4) Union calculation:

Add the two rectangular areas and subtract the intersection area. That is to say, to calculate Union, we must first calculate intersection (as shown in Figure 14). Lower right corner - x and y of upper left corner, calculate area.

(5) Usage scenario:

  • P. R network results use IOU, because the accuracy of these two networks is low, the results of retaining large frame and small frame are shown in Figure 16.

Figure 16 results of large frame with small frame -The IOU is not used in the O-network, and the intersection / minimum area = 1 (the confidence level is 1) is used to remove the small box in the large box.

(6) Algorithm implementation theory:

Compare a box with a bunch of boxes.

  • How to calculate the area of a pile of frames?

A: as shown in Figure 17, use (data in the third column minus data in the first column) * data in the fourth column minus data in the second column. Use matrix to calculate quickly.

  • How to retrieve column data?

A: slice. (box[:,2]-box[:,0])* (box[:,3]-box[:,1])

Figure 17 area calculation diagram of a pile of frames ####(7) Code:
import numpy as np
"""IOU"""
def iou(box,boxes,isMin=False):#The format of the box is defined as [X1,Y1,X2,Y2,C]. Compare a box with a bunch of boxes. In order to distinguish between intersection and union or minimum area, the minimum area is given the default value of 0
    #Calculate the area of each box
    box_area=(box[2]-box[0])*(box[3]-box[1])#Calculate the area of the box first. Area calculation of a frame: (X2-X1) * (Y2-Y1). Index to get coordinate value: (box[2]-box[0]) * (box[3]-box[1])
    boxes_area=(boxes[:,2]-boxes[:,0])*(boxes[:,3]-boxes[:,1])#The format of a stack of boxes: [], [], [], [], [], [],...]

    """Calculate intersection area"""
    xx1=np.maximum(box[0],boxes[:,0])#Top left X. Coordinates of the upper left corner of the intersection: the upper left corner X and Y of the two original intersecting boxes take the larger value as the coordinates of the upper left corner of the intersection. The upper left corner x value of the compared box: box[0]; the upper left corner x value of the compared box: boxes[0]. Remove the larger of the two.
    yy1=np.maximum(box[1],boxes[:,1])#The same thing. Y in the upper left corner.
    xx2 = np.minimum(box[2], boxes[:, 2])  # Same thing. Lower right X.
    yy2 = np.minimum(box[3], boxes[:, 3])  # The same thing. Y in the lower right corner.

    #Judge whether there is intersection
    w=np.maximum(0,xx2-xx1)#When the value of xx2-xx1 is negative, there is no intersection, and the result without intersection can be changed to 0. Use the maximum function to get the larger value.
    h=np.maximum(0,yy2-yy1)#The same principle.

    #Calculate intersection area
    inter=w*h

    if isMin:#If isMin is True, it is divided by the minimum area.
        over=np.true_divide(inter,np.minimum(box_area,boxes_area))#True? Divide: Division. When isMin is true, divide by the minimum area. How to get the minimum area? Compare box area with boxes area, and take the minimum value to get the minimum area.
    else:#Otherwise, divide by the Union area.
        over = np.true_divide(inter, (box_area+boxes_area-inter))#Add two rectangular areas and subtract intersection area
        
    return over

9. threshold

When a large frame is nested with a small frame and there is an overlap, the IOU is small. Figure 18 gives a threshold value, such as 0.3. When the IOU is greater than 0.3, it is regarded as a pile. When the IOU is less than 0.3, it is regarded as two piles of data.

(1) Purpose

Continue framing.

Fig. 18 threshold ### 10.NMS

(1) Purpose:

Go out the extra boxes.

(2) Thought

Figure 19. First, the confidence is sorted; then, the maximum value is used to compare the IOU with the rest values; the IOU values of 0.98 and 0.83 are large, and the box to the same object is deleted by 0.83; then, the IOU values of 0.98 and 0.81 are 0, which are two objects, which are retained; in turn, the IOU values of 0.98, 0.81 and 0.67 are retained; next, the IOU is compared by 0.81 and 0.67. The final result is shown in Figure 20.

For example: 0.98 0.83 0.81 0.75 0.67

NMS is done on each diagram. Because of the pyramid, many boxes are reserved after NMS is used.

Fig. 19 NMS

Figure 20 final results ####(3) NMS algorithm code:

Sort a bunch of boxes according to their confidence;

Remove the first box. When the dimension in a stack of boxes is less than or equal to 1, it means that the retrieval is completed;

Save the first box out;

At the same time, the remaining boxes are reserved;

Compare IOU.

"""NMS"""
def nms(boxes,thresh=0.3,isMin=False):#All boxes, thresholds, minimum area are required (pass to IOU, because IOU is calculated in NMS)
    #Sort according to the confidence level from large to small.
    _boxes=boxes[(-boxes[:,4]).argsort()] #Get a bunch of boxes sorted by confidence      #The format of the box is defined as: [[X1,Y1,X2,Y2,C], [], [], [], [],...].

    #Keep remaining boxes
    r_boxes=[]
    #Remove the first box. Because it's going to take many times, use the loop. (key)
    while _boxes.shape[0]>1:#The first frame (shape[0]) is retrieved circularly. When the dimension retrieved during the cycle is greater than 1, it indicates that there is a frame; when the dimension is less than 1, it indicates that the frame has been retrieved and the cycle is over.
        #Take out the first box
        a_box=_boxes[0]
        #Remove the remaining boxes
        b_boxes=_boxes[1:]
        #Keep first box
        r_boxes.append(a_box)

        #After comparing IOU, keep the smaller value of threshold
        index=np.where(iou(a_box,b_boxes,isMin)<thresh)#Comparing iou with threshold value: iou (a-box, B-boxes, ismin) < thresh, if iou is less than threshold value, keep it. Use np.where, when less than True.
        _boxes=b_boxes[index]
        #Save results
    if _boxes.shape[0]>0:
        r_boxes.append(_boxes[0])

    #Assemble as matrix
    return np.stack(r_boxes)

11. Use of coordinate value activation function

  • Softmax:
    • The loss function cannot use Softmax. The value field does not meet the demand (greater than 1);
    • Softmax exclusivity. The output is a probability distribution, and its sum is 1. There is a relationship between the output results. The output of the network is composed of the center point and the width and height. There is no relationship between the four values.

Figure 21 Softmax activation function image - ReLU: -[] the value field is not satisfied. When the tracking object is close to the edge of the picture, and most of it is outside the picture, only a part of the tracking object remains in the picture. At this time, the center point of the tracking object is outside the picture, the center point is negative, and the ReLU function has no negative value.

Figure 22 image of relu activation function - singmod: -Similarly, * * singmod * * has no negative value. The value range is 0-1, but the output value of the network can be normalized, but the result is negative, so it can not be used.

Figure 23 singmod activation function image - Tanh: -[] * * Tanh * * has a negative value. It is activated. If the value is changed, it is not necessary (it can be used barely).

Figure 24 Tanh activation function image -It is better to use * * Y=X * *. Not activated

12. Programming method

Matrix (parallel) is used instead of for loop (serial) to improve the calculation speed. For example, a matrix can calculate the area of all suggestion boxes at once.

13. Introspection operation of characteristic graph

(1) Basic ideas

When using MTCNN for face recognition, the network is used as a convolution kernel to scan the image. The input size of the network is equivalent to finding the area with face on the image one by one. This area is called suggestion box. Whether there is a face in the suggestion box is judged by 5 values, 1 confidence and 4 coordinate points. When the confidence is close to 1, it indicates that there is a face in the suggestion box. When moving the suggestion boxes one by one, in order to avoid missing the area with face, it is necessary to overlap the suggestion boxes, which requires setting the step size. With the step size, the suggestion box will end up with a large number (not to mention the increase in the number of suggestion boxes caused by the pyramid). Because in order to get a suggestion box for a person's face, we need to use IOU and NMS to subtract most of the suggestion boxes with low confidence. In the calculation process of IOU and NMS, IOU is calculated by dividing the intersection of two suggestion boxes by the union. In the calculation process of union, the area size of two suggestion boxes is calculated (the calculation of intersection is directly calculated by the coordinate values of the upper left corner and the lower right corner of the intersection, as detailed above; the calculation of union is calculated by calculating the area sum of two suggestion boxes, and then subtracting De intersection area). When calculating the area, you should first know the coordinate values of the upper left corner and the lower right corner of the suggestion box. Next, we use the feature map calculated by the network to solve the coordinate values of the upper left corner and the lower right corner of the recommended box in the original map.

(2) Calculation of intermediate convolution once

  • Ideal situation.

    If the original image is convoluted to get a characteristic image of 2x2 size, the convolution kernel size is 4x4 and the step size is 3. How to calculate the original size in reverse?

Solution:

As shown in Figure 25 below, (a) is the original image; (b) is the characteristic image obtained after convolution, marked with index value in the image, index value (0,1) in the upper left corner coordinate of the original image is (3,0), where 3 is step size; lower right corner coordinate is (7,3), where 7 is step size 3 + convolution kernel size 4, and 4 is convolution kernel size.

Original image (a) ! [insert picture description here] (https://img-blog.csdnimg.cn/20200211204426632.pngාpic_center) Index diagram (b) Figure 25 ideal situation -In the reverse calculation, the original picture position is displayed in coordinates (x,y). X corresponds to w, y of the picture corresponds to h of the picture, that is, the picture format is wh. However, the result of the convolution is in the format of nchw. In this case, the coordinates obtained from the direct inverse calculation need to be converted, that is, * the index needs to be converted to the position * *.

Solution:

Upper left corner coordinate: index x step. For example, index (0,0) reversely solves the coordinates of the upper left corner of the original drawing as: (0,0) * 3 = (0,0); index (1,0) reversely solves the coordinates of the upper left corner of the original drawing as: (1,0) * 3 = (3,0); index (0,1) reversely solves the coordinates of the upper left corner of the original drawing as (0,3); index (1,1) reversely solves the coordinates of the upper left corner of the original drawing as (3,3).

Coordinates of the lower right corner: index x step size + convolution kernel size. For example, index (0,0) solves the upper left corner coordinates of the original drawing in reverse: (0,0) * 3 + 4 = (4,4); index (1,0) solves the upper left corner coordinates of the original drawing in reverse: (1,0) * 3 + 4 = (7,4); index (0,1) solves the upper left corner coordinates of the original drawing in reverse: (4,7); index (1,1) solves the upper left corner coordinates of the original drawing in reverse: (7,7).

Note: if there is scale, divide the two results by the scale.

Original image (a) ! [insert picture description here] (https://img-blog.csdnimg.cn/20200211204501814.pngාpic_center) Index diagram (b) Figure 26 actual situation ####(3) Multiple calculation of intermediate convolution
  • thought

  • Many convolutions are called one convolution (for example, two 33 convolutions replace one 55 convolution), that is, many layers of neural network are regarded as a large convolution kernel.

  • The convolution kernel size is equal to the original image size.

  • The large convolution kernel step is equal to the product of single small convolution kernel step.

(5) The application of reverse calculation of feature graph

Use the network convolution result to calculate the size of the suggestion box.

Figure 27 schematic diagram of reverse calculation ###14. Network structure

P-R-O network is equivalent to HR-technology-supervisor in real interview process

​ P network has a short processing time and the fastest processing time, i.e. small network, low precision and nonstandard (only a few cursory decisions can be made that you are OK, no big problem in mind, good health, in short, just an individual), but p network has the longest drawing time in practical application (large amount of data processed); R network has a high precision (for example, every person should be considered in the interview Technology, the difficulty of technology itself is large, and the processing time is slow; O-network processing time is long, the network is the largest, and the processing accuracy is the highest (when understanding, refer to the actual interview, the supervisor should talk with the interviewee for a long time, which can not be determined at one time). The supervisor will draw you a big cake, slowly attract you, and "brainwash" you. It's not a moment.

Figure 28 MTCNN network structure ####(1) P network
  • Network design

Input 12X12, output 1X1, the middle can be regarded as the convolution kernel of 12X12.

  • Firstly, after 3 x 3 convolution kernel, the step size is 1, and the feature map of 10 x 10 is obtained;
  • After 3 x 3 Maximum pooling, step size is 2 (there is a part of overlap, more information is not lost, more information is retained), and a feature map of 5 x 5 size is obtained;
  • After 3 x 3 convolution kernel step length is 1, 3 x 3 characteristic graph is obtained;
  • Finally, after 33 convolution kernels, we get 11 size characteristic graphs.

There are three layers of 3 x 3 convolution, the first layer uses pooling, and finally obtains 1 x 13 2 size feature map. The last layer uses full convolution instead of full connection (full connection fusion channel, but the image size will be limited. Full connection multiplies W and H, and receives the image of the data. The convolution neural network format is NCHW, and the full connection format is NV. When using full connection, CHW needs to be multiplied).

The final results are output into three kinds of results: first, we use two convolution check 1x1x32 to do convolution to get 1x1x2, that is, the confidence (the original paper uses Softmax activation to get two confidence). It is recommended to use the result of 1x1x1 and activate with Sigmoid to get a value. Because the confidence only needs one value, change the original paper, after all, the original paper published earlier, the idea is not comprehensive enough); second, use two convolution checks 1x1x32 to do convolution to get 1x1x4, get four coordinate values of the face (two coordinate points have four values in total); third, the results of 1x1x10 mark the five senses (two eyes, one nose, one mouth) key points, eyes for the original paper Two points for eyes, one for nose and two for mouth.

Figure 29 P network structure -Network usage
  • Finally, 1x1x1 is the confidence value and 1x1x4 is the face coordinate value. The two are treated separately and the activation function is used respectively. The training confidence and the data set used in the training coordinates are different. Training confidence is a two classification problem, which uses face and no face data sets. When training coordinate points, all data sets need faces, but the coordinate values of faces are different.
  • What are the four coordinate points activated with? (explained earlier)

Cannot be activated with Softmax function, which is exclusive. The four coordinate points should not be connected. The sum of the output values of this function is 1. The value range of Sigmoid (only positive value) is not satisfied. When there are only some faces, there will be a coordinate value outside the picture, and a negative value will appear, but the negative value can be used. However, the training process of half face and one face is not the same concept. Generally, the sample is the whole face, at this time, the coordinates will produce a negative value. Tanh, ReLU, Y=X three activation functions can be used, Y=X is the best, because the knots need specific coordinate points, and the values calculated by the network can be used. Although tanh value range is satisfied, the value is deformed. ReLU deforms the negative half axis pair.

  • How to normalize when the picture format is coordinate value?

Divide the coordinate value by the length of the longest side.

  • How to normalize the image format when it is pixel value?

Divide the pixel value by 255. That is, divided by the maximum value.

(2) R network

  • First, the convolution of 3 x 3 is used, and the step size is 1;
  • 3 x 3 pooling, step size 2;
  • Next, use 3
  • 3 x 3 pooling, step size 2;
  • Continue, Using 2x2 convolution,;
  • Finally, there is a full connection layer.

Because the input size of R network is fixed and the input is the result of P network processing, it is no problem to convert the full connection into full convolution. Compared with p-network, r-network has more weight and higher precision. Finally, the output results are 1 confidence and 4 coordinate points.

Figure 30 R network structure ####(3) O network

The result of R network processing is handed over to O network. In O network, there are four volume layers and three pooling layers, which are larger than R network. Finally, one confidence and two coordinate points (four values) are output.

Figure 31 O network structure ####(4) Tips
  • P network is equivalent to 12 * 12 convolution kernel.

    The input of P network 12x12 refers to the suggestion box, which scans 12x12 area every time. When the input 12x12 is changed to 14x14, but because the convolution layer of the middle three layers 3x3 is equivalent to the convolution core of 12x12, changing the input size will not change the nature of the network (12x12). It is suggested that the frame size is equal to the convolution kernel size, that is, the convolution kernel of *** 12x12.

  • If the input P network picture size is 13x13, the P network output size is 2x2x32 (with padding). The original result 1x1x5 has five values. When the input image size is 13x13, the output has 4x5 values, that is, the image is divided into four regions. If you input a picture to get the NxNx32 size feature map, that is, to get the shape of Nx5 (one NxNx1 (confidence) and one NxNx4 (four coordinate values)), you have scanned the input picture five times. In other words, input any size of image, P network scans with 12x12 convolution kernel to get the value of NxNx5. At this time, check 5 values (confidence, coordinate value) in each region to determine whether there is a face.

Figure 32 P network output 2x2x5 result chart -When p network uses 12x12 size suggestion box for face scanning, if there is a face in the suggestion box, the face area will be framed. As shown in Figure 26. There are many boxes (related to step size) in the actual situation. There are many kinds of faces in the frame, such as * * part of the face, the whole face (square & rectangle) * *. The result (the result after P network processing, the area at the box, as shown in Figure 27) is handed over to R network as the input data of R network. The frame size of P network is different. In the original paper, R network can only receive input of 24x24 size, so * * the output of P network needs to be converted to 24x24 size * *. If resize cannot be used, the image will be distorted; if scale by scale (H and W scale at a certain scale at the same time) is used, the result is less than 24x24, and then fill (the first filling method: make sure that the face is in the middle, and fill both sides; the second filling method: first, generate a 24x24 white image, then compress the longest edge of the framed image to 24, and finally put it in white In the picture.). -* * the whole process from P network to R network * * is: P network gets a pile of boxes, that is, a pile of data, which is used to cut out the box area on the original image, then conduct shape transformation, and input r network. -* * the whole process from R network to O network * * is: the output result of R network is the same as that of P network (one confidence, two offsets), the value with large progress is obtained, then the original figure is deducted, scaled to 48x48, and input into O network.

Figure 33 original map of P network reverse ! [insert picture description here] (https://img-blog.csdnimg.cn/20200211204734652.pngාpic_center) Figure 34 frame the face area 400x200) Figure 35 P network processing results -What's the reason for the different input sizes of the three networks?

The accuracy of P network is the lowest, R network is a little higher, and O network is the highest. Therefore, the size of the network input characteristic graph is gradually increasing, increasing the calculation strength and accuracy.

  • How do R and O networks reverse the location of the original map?

It is the same as P-network back calculation. As shown in Figure 28, the result of three network frames to

Figure 36 results of three network frames -* offset replaces coordinate point * *.

Use offset instead of coordinate value in the network. As shown in Figure 37, the green box indicates the recommendation box, and the red box indicates the actual box. Why offset? 1) When image pyramids are used for image zooming, it is of little significance to find the coordinate point, and the offset is of significance. In other words, after the image is zoomed, the coordinate point is not available, but the offset is still available; 2) the offset is easy to be normalized, and the coordinate value is not easy to be normalized. How is the offset calculated? The upper left corner of the actual box is relative to the upper left corner of the suggestion box, and the lower right corner of the actual box is relative to the lower right corner of the suggestion box to calculate the offset (the suggestion box corresponding to the actual box of P network result is 12x12, the suggestion box corresponding to the actual box of R network result is the actual box of P network, and the suggestion box corresponding to the actual box of O network result is the actual box of O network; the reference point of the actual box offset is not all refer to the left corner of the suggestion box The upper corner is because the coordinate value of the lower right corner of the actual box is large, which will get a larger value compared with the value of the upper left corner of the suggestion box. Using the larger value divided by the corresponding side length of the suggestion box will get a larger quotient, which cannot achieve the normalization effect. As shown in Figure 37, the offset of point b: the offset of point x is (Xa-X1) / W, and the offset of point Y is (Ya-Y1) / h. When the neural network is trained well, the result is offset. How to reverse the original image position in the actual frame? X at point b is Xa offset W, Y is Ya offset * H.

Figure 37 calculation of offset
  • Use of offset:

    Training and use.

  • The offset code is as follows:

 # Calculate the offset value of the coordinates
offset_x1 = (x1 - x1_) / side_len
offset_y1 = (y1 - y1_) / side_len
offset_x2 = (x2 - x2_) / side_len
offset_y2 = (y2 - y2_) / side_len

15. Network training

(1) Three networks

You can train alone.

(2) Two losses:

One for confidence and one for offset.

  • Confidence level:

    The tag uses 0 (no face) and 1 (with face), so there are two kinds of data: a group of data with face and a group of data without face. Labels: 0 and 1.

  • Offset:

    Each image is required to have a face, so there will be an offset. So, what's the difference between data with faces? The position of the face is different, that is, the offset is different. Data: positive samples and partial samples. The offset of some samples is large. When the neural network trained part of the sample face, when using the network for face recognition, it will recognize part of the face out of the frame. The figure shows.

Fig. 38 schematic diagram of partial samples and positive samples (the red rectangle inside represents the positive sample box, the red circle represents all faces, and the green box represents the box of partial samples) ####(3) * training dataset * *:

Wider? Face and celebA

  • Usage of the wider? Face dataset:

* the face is relatively small. There are multiple faces in one picture, which can track smaller faces. Advantages: when using wider face trained network for more face recognition, the number of faces tracked will only be more, not less, and the recall rate will be higher. Disadvantages: the face in the training data set is small, and the recognition accuracy is low, that is, the probability of frame error is large.

  • Use of the celebA dataset:

Advantages * *: when using the celebA trained network for face recognition, the accuracy is high. Disadvantages: however, the recall rate is low, that is, smaller faces will be discarded, that is, smaller faces cannot be framed.

  • The two kinds of data sets are used differently due to different situations.

This example uses the celebA dataset.

  • To view the celebA dataset (positive sample):
from PIL import Image,ImageDraw
import os

IMG_DIR = r"E:\Data\Data_AI\CelebA\Img\img_celeba.7z\img_celeba"#data
AND_DIR = r"E:\Data\Data_AI\CelebA\Anno"#Label

#Image Reading
img=Image.open(os.path.join(IMG_DIR,"000002.jpg"))
img.show()

#Read the label and draw the label position on the image
imgDraw=ImageDraw.Draw(img)
imgDraw.rectangle((72,94 ,72+221 ,94+306),outline="red")#Label text value: 72 94 221 306: X1,X2, W, H. Convert to drawing coordinate value: 72, 94, 72 + 221, 94 + 306: X1,Y1,X2,Y2
img.show()

Figure 39 unframed

Figure 40 frame up This * * frame is too large. That is to say, the celebA tag is too large, and the frame of network identification obtained by using the data of the larger tag is also too large. When * * is used * *, you can manually reduce the data frame (generally, there will be deviation if the program is used to reduce it), or increase the offset. If you want to produce high-precision results, you need to buy or build data sets. To get good results, we need about 1 million to 1.2 million face data.
  • To view the wider? Face dataset:

The frame of data label is more standard. However, the false frame rate is high (it will frame some hair, shoes (red shoes and red hair misjudgment), etc. when the adult face, false frame).

(4) Sample addition:

  • Theory:

If the positive sample box is known, calculate the center point of the box; then, move the center point randomly along the top, bottom, left and right, and the maximum distance is not more than: 1 / 2 of the height and 1 / 2 of the width; then, generate a square box with the center point after translation (because the input of P, R, O networks is a square), and the maximum side length of the box is not more than the original positive sample The short side length makes the side length of the positive sample change randomly in this sample interval. In this way, a lot of frames can be drawn. The characteristics of these frames are: some frames have more faces, and some mines have fewer faces. Take this stack of boxes as positive and negative samples. So, how to distinguish positive and negative samples? Use IOU. Compare the original positive sample boxes with these boxes for IOU. The following are the recommended IOU values given in the original paper:

0-0.3: non face (non face data cannot be generated by using the above method.)

0.65-1.00: face (positive sample)

0.4-0.65: some faces (some samples)

0.3-0.4: negative sample

  • Training sample proportion:

Negative sample: positive sample: partial sample: landmark = 3:1:1:2

  • Actual:

Translate the center point of the original frame randomly within a certain range; take the translated point as the center point of the square to create a square; the principle of creating a square is: the minimum square side length is times of the width of the original frame and the minimum side of the middle and high school, and the maximum square side length is times of the width of the original frame and the maximum side of the middle and high school (the value is too large, it can be adjusted by itself). The code is as follows:

                    for _ in range(5):
                        #Make the center of the face slightly deviate
                        w_=np.random.randint(-w*0.2,w*0.2)
                        h_=np.random.randint(-h*0.2,h*0.2)
                        cx_=cx+w_
                        cy_=cy+h_

                        #Let the face form a square, and let the coordinates slightly deviate
                        side_len=np.random.randint(int(min(w,h)*0.8),np.ceil(1.25*max(w,h)))#np.ceil(): round up
                        #Coordinate point of the upper left corner of the square
                        x1_=np.max(cx_-side_len/2,0)
                        y1_=np.max(cy_-side_len/2,0)
                        # Coordinate point of the lower right corner of the square
                        x2_=x1_+side_len
                        y2_=y1+side_len

                        crop_box=np.array([x1_,y1_,x2_,y2_])
  • Manufacturing negative samples:

  • The first method:

The part outside the original data frame is cropped as a non face, as shown in Figure 33.

Figure 41 non face manufacturing method -[] second method:

Add samples according to the sample adding method. Use IOU to divide negative samples.

  • The third method:

Add samples separately. First, set a range value: the minimum value is: face? Size, and the maximum value is: half of the shortest edge of the picture. The upper left coordinate range is: x1:0-picture width minus range value, Y1:: 0-picture height minus range value. The lower right coordinate range is: x2:x1 + range value, y2:y1 + range value. (this method sometimes involves partial face and sometimes complete face. The IOU value can be reduced, but the negative samples generated will be reduced.) The schematic diagram is as follows:

Figure 42 schematic The code is as follows:
                    for i in range(5):
                        side_len = np.random.randint(face_size, min(img_w, img_h) / 2)
                        x_ = np.random.randint(0, img_w - side_len)
                        y_ = np.random.randint(0, img_h - side_len)
                        crop_box = np.array([x_, y_, x_ + side_len, y_ + side_len])

                        if np.max(NMS.iou(crop_box, _boxes)) < 0.3:
                            face_crop = img.crop(crop_box)
                            face_resize = face_crop.resize((face_size, face_size), Image.ANTIALIAS)

                            negative_anno_file.write("negative/{0}.jpg {1} 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n".format(negative_count, 0))
                            negative_anno_file.flush()
                            face_resize.save(os.path.join(negative_image_dir, "{0}.jpg".format(negative_count)))
                            negative_count += 1
  • The fourth method:

Any picture of the reptile can be picked out as non face data. The background color should be complex.

  • Reminder

Using the celebA data set, we can make negative samples without downloading some sample data of face image to reduce the workload. There are three methods: sample augmentation.

  • Sample situation

12x12 positive samples, negative samples, partial samples; 24x24 positive samples, negative samples, partial samples; 48x48 positive samples, negative samples, partial samples. Three networks can be trained at the same time.

  • Performance requirement

Notebook training is available (each network structure is very small).

  • Label status:

  • Label: one confidence and four offsets.

  • Sample: positive sample, partial sample, negative sample.

  • Confidence degree: positive sample (1), negative sample (0), partial sample (2) (give a confidence degree value randomly, ensure the same format). Note: when training confidence, only use the confidence values of positive samples (1) and negative samples (0), not the offset value; when training migration, only use the offset values of positive samples (1) and partial samples (2), not the confidence value. Fixed, the negative sample confidence can be assigned a value at will. Programmatically separate the data.

  • The format of the created data is:

Figure 43 three network dataset folders

Figure 44 same sample type in each network -[] * * change original paper * *:

The IOU value of the original paper cannot be used to make samples. As shown in Figure 45 below, the negative samples obtained by manufacturing the sample set with the IOU value given in the original paper contain some faces.

Figure 45 use the IOU value of the original paper to make negative sample data set containing some faces Some of the samples are not standard and contain complete faces. Fig. 46.

Figure 46 some samples contain complete face data The samples made according to the original paper are not standard, and the trained network is poor. Adjust the IOU value so that part of the samples only contain some faces, and the positive samples only contain complete faces.
  • Negative sample tag value:

Figure 47 negative sample tag values (one confidence, four offsets, 10 key points) -[] * * some sample tag values:**

Figure 48 partial sample tag values (one confidence, four offsets, 10 key points) -[] * * positive sample label value:**

Figure 49 positive sample label values (one confidence, four offsets, 10 key points) ###16. Use of network

(1) Detailed explanation

First of all, for an incoming picture (Figure 50 - (0)), make an image pyramid (the input size of P network is 12x12, and the incoming picture is generally larger than 12x12. At this time, make pyramid processing for the image, and box the larger face in the picture.) , get a bunch of faces, as shown in Figure 50 - (1); then, pass this bunch of frames into the P network, get a bunch of frames shown in Figure 50 - (2) (why do big frames cover small ones? Image pyramid. The more serious the scaling is, the larger the frame will be); then, after NMS removes some of the frames, NMS removes the frames on each picture, and the result remains a pile of frames, but less than before (Figure 50 - (3)); then, according to this pile of frames, find out the area of the two frames from the original picture, and extract the original area, resize it into a 24x24 square, and then transfer it into the R network, and the R network will carry out another one Select the secondary box (Fig. 50 - (4)), make NMS for the results of box selection again, and leave a pile of boxes (Fig. 50 - (5)); then, find the area recognized by R network from the two boxes, and buckle it down, resize it into a square of 48x48, and then transfer it to O network, then select the 0 network again (Fig. 50 - (6)), and get a box, and draw the box without matting (Fig. 50 - (7)).

(0)

(1)

(2)

(3)

(4)

(5)

(6)

(7) Figure 50 use of network ####(2) Warm tips
  • Low P network accuracy:

Because before using the P network, the size of the picture larger than 12x12 will be reduced, which makes the picture pixel lower, fixed, and network recognition accuracy lower.

  • R high network accuracy:

R-network is to enlarge the area framed by P-network, and then select the frame on the original image of the enlarged area. The pixel is high and the recognition accuracy is improved.

  • O highest network accuracy:

Similarly, the recognition accuracy of R-network is also improved. O network has the largest amount of data (input 4848 size pictures), which makes the training network with the highest accuracy.

  • Question 1: how does the pyramid (Figure 40 - (1)), and then figure 40 - (2) correspond to each picture?

A: first of all, not together. In programming, a picture is transferred into the P network first, and then NMS is done to leave some frames (using [[]] for storage); then, the picture is given a certain scale (such as 0.7), and then transferred into the P network to get a pile of frames (stored in [[], []]); then, repeat the above operations, and finally guarantee to draw a pile of frames on the original picture (Figure 40 - (2)); then, the R network is transferred into the R network, and the R network is based on the p Calculated on the frame out of the network frame; and so on

  • Question 2: how to calculate confidence and offset in groups?

Answer: in a set of values [X1,Y1,X2,Y2,C] When calculating the confidence, only C is taken out; when calculating the offset, only X1,Y1,X2,Y2 are taken out.

17. Advantages and disadvantages of mtcnn

(1) Advantages

Universal tracking

(2) Disadvantages

False alarm high: it's easy to recognize the adult face from things that are not human faces. The main reason is that the network structure is shallow. Purpose: quickly filter out non face, and then use other networks.

4, Detailed analysis of project code

: observe the sample data – according to the network design loss – organize the data – design the network – training training – verification

  • Observation sample data: sample data determines the final result;
  • Design loss: the loss is designed, that is, the general design of the project is completed. (core and difficulty)
  • Organize data: generally, the data provided can not meet their own needs. For example, 12x12, 24x24 and 48x48 in MTCNN contain positive samples, negative samples and some samples;
  • Design Network: design network structure.
  • Training network: use the sample data to train the network parameters, so that the parameters are optimal.
  • Verification: test whether the network can achieve the expected results.

Note: the first three steps are most important.

1. Organize data

(1) Sample storage path:

Figure 51 storage form of sample in file ####(2) To create a file in Notepad:

w mode of open permission: if there is a file, it will be overwritten; if there is an empty file, it will be created.

Project process: manufacturing sample - write network - manufacturing data set –

(3) Full code

  • Firstly, the attribute of one-time design dimension is defined;

  • Next, declare the picture storage path. If the path does not exist, create it;

  • Next, the label storage path is declared;

  • Then, count three kinds of samples. The image storage name is stored according to the count to ensure no repetition;

  • Next, read in the tag file. Traverse each line without reading the first two lines;

  • Next, read the contents of each line. Read out the picture name;

  • Next, the picture is read according to the picture name and the picture path;

  • Next, make data;

  • ######Get the width and height of the picture.

  • Get the coordinates of the upper left corner of the suggestion box.

  • Gets the width and height of the suggestion box.

  • Get the coordinates of the lower right corner of the suggestion box.

  • Five keys ignored.

  • Filter fields. Exclude boxes that are too small. (exclude the nonstandard boxes in the sample. If the sample frame is less than 40, the learned face is very nonstandard, and the trained network frame error rate is very high, resulting in low accuracy.)

  • Store four coordinate points that meet the requirements.

  • Calculate the coordinates of face center points.

  • Number of randomly generated samples.

  • The offset value of the random center point.

  • Generates a new center point based on the offset value.

  • Make a square box and offset the box. The center point is the randomly generated center point.

  • Calculate the coordinate offset value. Calculate the offset between the generated box and the actual box of the data sample.

  • Matting and scaling (based on 12x12,24x24,48x48 size).

  • Is the sample positive, negative or partial?

    • Pass the generated box into IOU to calculate IOU value.
    • Positive sample: write label (confidence level is 1); save picture.
    • Part of the sample: write the label (confidence level is 2); save the picture.
    • Negative sample: write label (confidence level is 0); save picture. (in this way, there are few negative samples, or even no negative samples)
  • Negative samples are generated separately.

    First, set a range value: the minimum value is: face? Size, and the maximum value is: half of the shortest edge of the picture. The upper left coordinate range is: x1:0-picture width minus range value, Y1:: 0-picture height minus range value. The lower right coordinate range is: x2:x1 + range value, y2:y1 + range value. (this method sometimes involves partial face and sometimes complete face. The IOU value can be reduced, but the negative samples generated will be reduced.)

  • Store samples.

  • Close manufacturing.

import os
from PIL import Image
import numpy as np
from MTCNN import NMS
import traceback


anno_src=r"E:\Data\Data_AI\CelebA\Anno\list_bbox_celeba.txt"#Label
img_dir=r"E:\Data\Data_AI\CelebA\Img\img_celeba.7z\img_celeba"#picture

save_path=r"E:\project_folder\project_AI\MTCNN\celeba1"#Storage of sorted data

for face_size in [12,24,48]:

    print("gen %i image" % face_size)

    #Sample image storage path
    positive_image_dir=os.path.join(save_path,str(face_size),"positive")
    negative_image_dir=os.path.join(save_path,str(face_size),"negative")
    part_image_dir=os.path.join(save_path,str(save_path),"part")

    #Determine whether the three folders exist. If not, create them.
    for dir_path in [positive_image_dir,negative_image_dir,part_image_dir]:
        if not os.path.exists(dir_path):
            os.makedirs(dir_path)

    #Sample label storage path
    positive_anno_filename=os.path.join(save_path,str(face_size),"positive.txt")
    negative_anno_filename=os.path.join(save_path,str(face_size),"negative.txt")
    part_anno_filename=os.path.join(save_path,str(face_size),"part.txt")

    #Count three kinds of samples respectively, purpose: write the picture name with the number of non repetition.
    positive_count=0
    negative_count=0
    part_count=0

    try:
        # Creating text file in w mode with open permission
        positive_anno_file=open(positive_anno_filename,"w")
        negative_anno_file=open(negative_anno_filename,"w")
        part_anno_file=open(part_anno_filename,"w")

        """Get sample information"""
        #Open label
        for i ,line in enumerate(open(anno_src)):
            if i<2:
                continue
            try:
                """Read pictures"""
                #Take the content between travel texts
                # strs=line.strip().split("")
                # strs=list(filter(bool,strs))
                strs = line.strip().split()
                image_filename=strs[0].strip()#Read the picture name. strip(): prevent spaces before and after
                print(image_filename)
                image_file=os.path.join(img_dir,image_filename)

                """Create data"""
                with Image.open(image_filename) as img:#Open the picture.
                    img_w,img_h=img.size#Get the width and height of the picture
                    x1=float(strs[1].strip())
                    y1=float(strs[2].strip())
                    w=float(strs[3].strip())
                    h=float(strs[4].strip())
                    x2=float(x1+w)
                    y2=float(y1+h)

                    #5 key points (not required temporarily)
                    px1=0#float(strs[5].strip())
                    py1=0#float(strs[6].strip())
                    px2=0#float(strs[7].strip())
                    py2=0#float(strs[8].strip())
                    px3=0#float(strs[9].strip())
                    py3=0#float(strs[10].strip())
                    px4=0#float(strs[11].strip())
                    py4=0#float(strs[12].strip())
                    px5=0#float(strs[13].strip())
                    py5=0#float(strs[14].strip())

                    #Filter fields (exclude non-standard boxes from the sample. If the sample frame is less than 40, the learned face is very nonstandard, and the trained network frame error rate is very high, resulting in low accuracy.)
                    if max(w,h)<40 or x1<0 or y1<0 or w<0 or h<0:
                        continue

                    boxes=[[x1,y1,x2,y2]]#Store coordinate points that meet the requirements

                    #Calculate the location of face center point
                    cx=x1+w/2
                    cy=y1+h/2

                    #Double the number of positive and partial samples
                    for _ in range(5):
                        #Make the center of the face slightly deviate
                        w_=np.random.randint(-w*0.2,w*0.2)
                        h_=np.random.randint(-h*0.2,h*0.2)
                        cx_=cx+w_
                        cy_=cy+h_

                        #Let the face form a square, and let the coordinates slightly deviate
                        side_len=np.random.randint(int(min(w,h)*0.8),np.ceil(1.25*max(w,h)))#np.ceil(): round up
                        #Coordinate point of the upper left corner of the square
                        x1_=np.max(cx_-side_len/2,0)
                        y1_=np.max(cy_-side_len/2,0)
                        # Coordinate point of the lower right corner of the square
                        x2_=x1_+side_len
                        y2_=y1+side_len

                        crop_box=np.array([x1_,y1_,x2_,y2_])

                        #Calculate the offset value of the coordinates
                        offset_x1=(x1-x1_)/side_len
                        offset_y1=(y1-y1_)/side_len
                        offset_x2=(x2-x2_)/side_len
                        offset_y2=(y2-y2_)/side_len

                        #Five key points (not considered temporarily)
                        offset_px1=0    # (px1-x1)/side_len
                        offset_py1 = 0  # (py1-y1)/side_len
                        offset_px2 = 0  # (px2-x2)/side_len
                        offset_py2 = 0  # (py2 -y2 )/side_len
                        offset_px3 = 0  # (px3-x3)/side_len
                        offset_py3 = 0  # (py3-y3)/side_len
                        offset_px4 = 0  # (px4-x4)/side_len
                        offset_py4 = 0  # (py4-y4)/side_len
                        offset_px5 = 0  # (px5 -x5 )/side_len
                        offset_py5 = 0  # (py5-y5)/side_len

                        #Crop and zoom the picture
                        face_crop=img.crop(crop_box)#crop: matting
                        face_resize=face_crop.resize((face_size,face_size))#Zoom to 12 * 12, 24 * 24, 48 * 48

                        #Judge whether the sample is positive, negative or partial
                        iou=NMS.iou(crop_box,np.array(boxes))[0]#Calculate IOU value
                        if iou >0.65: #Positive samples
                            #Store pictures
                            positive_anno_file.write(
                                "positive/{0}.jpg{1}{2}{3}{4}{5}{6}{7}{8}{9}{10}{11}{12}{13}{14}{15}\n".format(
                                    positive_count,1,offset_x1,offset_y1,offset_x2, offset_y2, offset_px1, offset_py1, offset_px2, offset_py2, offset_px3,
                                    offset_py3, offset_px4, offset_py4, offset_px5, offset_py5
                                )
                            )
                            positive_anno_file.flush()
                            #Storage label
                            face_resize.save(os.path.join(positive_image_dir,"{0}.jpg".format(positive_count)))
                            positive_count+=1
                        elif iou >0.4: #Partial samples
                            # Store pictures
                            part_anno_file.write(
                                "part/{0}.jpg {1} {2} {3} {4} {5} {6} {7} {8} {9} {10} {11} {12} {13} {14} {15}\n".format(
                                    part_count, 2, offset_x1, offset_y1, offset_x2,
                                    offset_y2, offset_px1, offset_py1, offset_px2, offset_py2, offset_px3,
                                    offset_py3, offset_px4, offset_py4, offset_px5, offset_py5)
                            )
                            part_anno_file.flush()
                            # Storage label
                            face_resize.save(os.path.join(part_image_dir,"{0}.jpg".format(part_count)))
                            part_count+=1
                        elif iou<0.3:#Negative samples (few or no negative samples)
                            # Store pictures
                            negative_anno_file.write(
                                "negative/{0}.jpg {1} 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n".format(negative_count, 0)
                            )
                            negative_anno_file.flush()
                            # Storage label
                            face_resize.save(os.path.join(negative_image_dir,"{0}.jpg".format(negative_count)))
                            negative_count+=1

                        #Negative samples are generated separately (some faces will be deducted,)
                        _boxes=np.array(boxes)

                    for i in range(5):
                        side_len=np.random.randint(face_size,min(img_w,img_h)/2)#The minimum value is: face_size, the maximum value is: half of the shortest edge of the picture
                        x_=np.random.randint(0,img_w-side_len)#
                        y_=np.random.randint(0,img_h,side_len)
                        crop_box=np.array([x_,y_,x_+side_len,y_+side_len])

                        if np.max(NMS.iou(crop_box,_boxes))<0.3:#Value not standard
                            face_crop=img.crop(crop_box)
                            face_resize=face_crop.resize((face_size,face_size),Image.ANTIALIAS)

                            negative_anno_file.write("negative/{0}.jpg {1} 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n".format(negative_count,0))
                            negative_anno_file.flush()
                            face_resize.save(os.path.join(negative_image_dir, "{0}.jpg".format(negative_count)))
                            negative_count += 1
            except Exception as e:
                traceback.print_exc()


    finally:
        positive_anno_file.close()
        negative_anno_file.close()
        part_anno_file.close()

2. Network structure

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets,transforms

class PNet(nn.Module):

    def __init__(self):
        super(PNet,self).__init__()

        self.pre_layer=nn.Sequential(
            nn.Conv2d(3,10,kernel_size=3,stride=1),#conv1
            nn.PReLU(),#PReLU1
            nn.MaxPool2d(kernel_size=3,stride=2),#pool1
            nn.Conv2d(10,16,kernel_size=3,stride=1),#conv2
            nn.PReLU(),#PReLU2
            nn.Conv2d(16,32,kernel_size=3,stride=1),#conv3
            nn.PReLU()#PReLU3
        )

        self.conv4_1=nn.Conv2d(32,1,kernel_size=1,stride=1)#A confidence
        self.conv4_2=nn.Conv2d(32,4,kernel_size=1,stride=1)#Four offsets

    def forward(self,x):
        x=self.pre_layer(x)
        cond=F.sigmoid(self.conv4_1(x))#Activation confidence
        offset=self.conv4_2(x)#Inactive offset
        return cond,offset

class RNet(nn.Module):
    def __init__(self):
        super(RNet,self).__init__()
        self.pre_layer=nn.Sequential(
            nn.Conv2d(3,28,kernel_size=3,stride=1),#conv1
            nn.PReLU(),#prelu1
            nn.MaxPool2d(kernel_size=3,stride=2),#pool1
            nn.Conv2d(28,48,kernel_size=3,stride=1),#conv2
            nn.PReLU(),#prelu2
            nn.MaxPool2d(kernel_size=3,stride=2),#pool2
            nn.Conv2d(48,64,kernel_size=2,stride=1),#conv3
            nn.PReLU()#prelu3
        )
        self.conv4=nn.Linear(64*2*2,128)# conv4
        self.prelu=nn.PReLU()# prelu4
        """On the basis of full linearity, the reliability and offset are done directly. If we want to do it with full convolution, we need to turn the linear convolution back, which is troublesome"""
        # detection
        self.conv5_1=nn.Linear(128,1)#A confidence
        # bounding box regression
        self.conv5_2=nn.Linear(128,4)#Four offsets

    def forward(self,x):
        x=self.pre_layer(x)
        x=x.view(x.size(0),-1)#deformation
        x=self.conv4(x)
        x=self.prelu4(x)
        # detection
        label=F.sigmoid(self.conv5_1(x))
        # bounding box regression
        offset=self.conv5_2(x)
        return label,offset

class ONet(nn.Module):
    def __init__(self):
        super(ONet,self).__init__()
        self.pre_layer=nn.Sequential(
            nn.Conv2d(3,32,kernel_size=3,stride=1),#conv1
            nn.PReLU(),#prelu1
            nn.MaxPool2d(kernel_size=3,stride=2),#Pool1
            nn.Conv2d(32,64,kernel_size=3,stride=1),#conv2
            nn.PReLU(),#prelu2
            nn.MaxPool2d(kernel_size=3,stride=2),#Pool2
            nn.Conv2d(64,64,kernel_size=3,stride=1),#conv3
            nn.PReLU(),#prelu3
            nn.MaxPool2d(kernel_size=2,stride=2),#Pool3
            nn.Conv2d(64,128,kernel_size=2,stride=1),#conv4
            nn.PReLU()#prelu4
        )

        self.conv5=nn.Linear(128*2*2,256)# conv5
        self.prelu5=nn.PReLU()# prelu5
        # detection
        self.conv6_1=nn.Linear(256,1)
        # bounding box regression
        self.conv6_2=nn.Linear(256,4)

    def forward(self,x):
        x=self.pre_layer(x)
        x=x.view(x.size(0),-1)
        x=self.conv5(x)
        x=self.prelu5(x)
        # detection
        label=F.sigmoid(self.conv6_1(x))
        # bounding box regression
        offset=self.conv6_2(x)
        return label,offset

3. data set

  • Inherit Dataset;

  • Rewrite three methods: add the data set to the list; read and load the positive samples, negative samples and partial samples in the tag into the list; rewrite the len method.

  • getitm:

    • Get the picture, confidence and offset from the dataset. Think of the picture as X, and the confidence and offset as y. (dataset sample data [P,C,X1,Y1,X2,T2]).
    • Take out the data, get the picture path, take out the picture.
    • Take out the confidence and turn it into Tensor.
    • The offset is the same as above.
    • Normalize the picture.
    • And return the picture, confidence and offset.
  • Picture changing axis

    NHWC–>NChw

from torch.utils.data import Dataset
import os
import numpy as np
import torch
from PIL import Image

class FaceDataset(Dataset):

    def __init__(self, path):
        self.path = path
        self.dataset = []
        self.dataset.extend(open(os.path.join(path, "positive.txt")).readlines())
        self.dataset.extend(open(os.path.join(path, "negative.txt")).readlines())
        self.dataset.extend(open(os.path.join(path, "part.txt")).readlines())

    def __getitem__(self, index):
        strs = self.dataset[index].strip().split(" ")
        img_path = os.path.join(self.path, strs[0])
        cond = torch.Tensor([int(strs[1])])
        offset = torch.Tensor([float(strs[2]), float(strs[3]), float(strs[4]), float(strs[5])])
        img_data = torch.Tensor(np.array(Image.open(img_path)) / 255. - 0.5)

        # print(img_data.shape)
        #
        # a = img_data.permute(2,0,1)
        # print(a.shape)

        return img_data, cond, offset

    def __len__(self):
        return len(self.dataset)


if __name__ == '__main__':
    dataset = FaceDataset(r"D:\celeba4\12")
    print(dataset[0])

4. Training network

  • Three networks train at the same time:

  • The output is the same;

  • The training process is the same; (load the data and get the result)

  • Different data sets, different networks and the same results (loss of confidence and offset);

  • Write a module (Trainer) to train three networks at the same time. It mainly passes in two parameters (training data set, network) and saves the final result (parameters to be saved by the network).

  • Detailed analysis of trainer:

    • Incoming network, save path, dataset, GPU
    • Initialize the above four parameters
    • The loss of confidence is activated by cross entropy function.
    • The offset loss is activated by the mean square loss function.
    • Use the Adam() optimizer to optimize the parameters passed in.
    • If you have previously saved a model, continue training.
    • Load data.
    • Read picture, confidence, offset.
    • First, take the picture. Pass the picture into the network and return the confidence and offset. The confidence is transformed into shape (reason 1: output result shape of P network: NCHW (N111), output shape of R and O network: NV (N1). The output shape needs to be unified. The change to NV (N1) structure and not to NCHW (N111) structure is because the label structure is NV. Reason 2: when the P network inputs a large picture, the result is N1AA, which needs to be changed into NV structure: NxAxA, 1 structure. Such as N122 – > nx4,1).
    • The offset is deformed.
    • Calculate confidence loss: exclude some sample labels. Take out the confidence mask with confidence less than 2 from the label, and take out the label data with confidence of 0 and 1 by using the mask; take out the confidence result with confidence less than 2 from the output confidence result of the network (because the incoming pictures in the network contain the pictures corresponding to the confidence of all confidence types). ) The loss is calculated by the label confidence and the network output confidence.
    • Calculate offset loss: excludes negative sample labels. Take out the confidence mask with confidence greater than 0 from the label, and take out the label data with confidence of 1 and 2 by using the mask; take out the confidence result with confidence greater than 0 from the output confidence result of the network (because the incoming pictures in the network contain the pictures corresponding to the confidence of all confidence types). ) The label offset and network output offset are used for loss calculation.
    • Calculate the sum of confidence and offset losses.
    • Back propagation.
    • Optimization loss.
  • Done.

(1) Two methods of saving and loading network

  • Method 1: network parameters

After version 0.4, the shape requirements for model parameters are added in the new version. When you save parameters, you specify shapes.

Preservation:

   torch.save(model.state_dict(), PATH)

When the model is saved for reasoning, only the learning parameters of the trained model need to be saved. A common PyTorch convention is to save the model with a. pt or. pth file extension.

Load:

 model = TheModelClass(*args, **kwargs)
 model.load_state_dict(torch.load(PATH))
 model.eval()

Be careful:

a. model.eval() must be called to set the dropout and batch normalization layers to evaluation mode before running the inference. If this is not done, inconsistent inferences will result.

b. The load? State? Dict() function accepts a dictionary object, rather than the path where the object is saved. This means that you must deserialize the saved state \ u dict before passing it to the load \ u state \ u dict() function.

  • Method 2: network model (recommended)

Preservation:

  torch.save(model, PATH)

Load:

# Model class must be defined somewhere
  model = torch.load(PATH)
  model.eval()

(2) Then train

  • Use network parameters
if os.path.exists(self.save_path):
	net.load_state_dict(torch.load(self.save_path))
  • Using the network model
if os.path.exists(self.save_path):
	torch.load(self.save_path)

(3) Output result shape transformation

  • Confidence shape transformation:

  • The output confidence of the last layer of P network is in the form of NCHW, and the essence is N111.

    Note: N: batch; first 1: confidence level is 1 (channel is 1). Because the network was finally designed to be 1x1x32 in size.) ; the second 1: picture H (network input picture size 12x12, output size 1x1); the third 1: picture W (network input picture size 12x12, output size 1x1).

  • The output confidence of the last layer of R network is NV because of the linear layer, which is N1 in essence

  • The output confidence of the last layer of the O-network is in the form of NV, which is N1 in essence

  • The confidence label itself is a number. When the batch picture is input, the label shape changes to NV

    It mainly transforms the NCHW structure of P network into N1 structure. When the input image of P network is large, the output result size is 2x2, that is N122. At this point, you need to transform to nx41 structure. As shown in the figure below, a 2x2 size feature map is obtained by passing in a picture, with the shape of 1x1x2x2, which is transformed into NV structure: 4x1, i.e. [11 11 1]. The program deals with the large feature map separately. When the input picture size is 12x12, the output confidence is [[1]]; when the picture size is greater than 12x12, the output confidence is [[1], [2], [3] ]The program can judge the confidence in turn.

Figure 52 shape transformation ```python output_category =_output_category.view(-1,1) ```
  • Offset shape transform:

Ditto.

output_offset = _output_offset.view(-1, 4)
  • 10 key shape transformations

Ditto.

output_landmark = _output_landmark.view(-1, 10)

(4) Calculate losses by category

  • Take positive and negative samples from the label

That is, some samples are excluded. The following figure is text data, in which positive and negative samples are taken out. Take out the sample data with confidence of 0 and 1, and exclude the sample data with confidence of 2.

Figure 53 label form -[x] exercise method 1:
import  numpy as np

a= np.array([8,2,7,5,1,4])
print(a<5)#Boolean less than 5
print(a[a<[5]])#Value less than 5

Print results:

[False  True False False  True  True]
[2 1 4]
  • Practice method 2:
import  numpy as np

a= np.array([8,2,7,5,1,4])
print(np.where(a<5))#Index value less than 5
print(a[np.where(a<5)])#Value less than 5

Print results:

(array([1, 4, 5], dtype=int64),)
[2 1 4]
  • Practice method 3:
import torch

a=torch.Tensor([1,2,3,4,5])

print(a<4)#Output Boolean. In Python, 1 and 0 are used for True and False.
print(torch.lt(a,4))#lt: less than; gt: greater than; eq: equal to; le: less than equal to; ge: greater than equal to

#The following two methods are equivalent
print(a[a<4])
print(torch.masked_select(a,a<4))

Print results:

tensor([ 1,  1,  1, 0, 0])
tensor([ 1,  1,  1, 0, 0])
tensor([1., 2., 3.])
tensor([1., 2., 3.])

Code: (practice method 3)

category_mask=torch.lt(category_,2)#Exclude some samples. Take out the mask with confidence less than 2
category=torch.masked_select(category_,category_mask)#Extract data with confidence of 0 and 1 from the label according to the mask
  • Take positive and negative samples from network results

Figure 54 take out label and result data analysis ```python Output  category = torch.masked  select (output  category, category  mask)  extract data with confidence of 0 and 1 from the network results according to the mask. ```
  • Final code
category_mask=torch.lt(category_,2)#Exclude some samples. Take out the mask with confidence less than 2
                category=torch.masked_select(category_,category_mask)#Extract data with confidence of 0 and 1 from the label according to the mask
                output_category=torch.masked_select(output_category,category_mask)#Data with confidence of 0 and 1 are extracted from the results according to the mask.
                cls_loss=self.cls_loss_fn(output_category,category)

(5) Calculate the loss of offset

  • Practice removing 2D array offsets
import torch
import numpy as np

a=torch.Tensor([[1,2],[3,4],[5,6],[7,8],[9,10]])
b=torch.Tensor([1,2,3,4,5])

#One dimension takes two dimensions
print(a[b>3])

Print results:

tensor([[ 7.,  8.],
        [ 9., 10.]])
  • Final code
offset_mask=torch.gt(category_,0)#Negative sample does not participate in operation
offset=offset_[offset_mask]
output_offset=_output_offset[offset_mask]
offset_loss =self.offset_loss_fn(output_offset,offset)

(6) Print loss

Numpy does not support CUDA, so GPU cannot directly convert to numpy. To convert CUDA to cpu, and to. data (loss is value), and then to numpy.

print(" loss:", loss.cpu().data.numpy(), " cls_loss:", cls_loss.cpu().data.numpy(), " offset_loss",offset_loss.cpu().data.numpy())

(7) Save model

torch.save(self.net.state_dict(), self.save_path)
print("save success")#Every time you save it, it shows that it is saved successfully

(8) Training network code

import os
from torch.utils.data import DataLoader
import torch
from torch import nn
import torch.optim as optim
from MTCNN.simpling import FaceDataset

class Trainer:
    def __init__(self,net,save_path,dataset_path,isCuda=True):
        self.net=net
        self.save_path=save_path
        self.dataset_path=dataset_path
        self.isCuda=isCuda

        if self.isCuda:
            self.net.cuda()

        self.cls_loss_fn=nn.BCELoss()#the Binary Cross Entropy.  Loss of confidence
        self.offset_loss_fn=nn.MSELoss()#Loss of mean square deviation.

        self.optimizer=optim.Adam(self.net.parameters())#Optimizer

        #Load when there is a network model. Function: then train.
        if os.path.exists(self.save_path):
            net.load_state_dict(torch.load(self.save_path))

    def train(self):
        faceDataset=FaceDataset(self.dataset_path)
        dataloader=DataLoader(faceDataset,bath_size=512,shuffle=True,num_workers=4)#Data read to memory
        while True:
            for i,(img_data_,category_,offset_) in enumerate(dataloader):#Picture, confidence, offset
                if self.isCuda:
                    img_data_=img_data_.cuda()
                    category_=category_.cuda()
                    offset_=offset_.cuda()

                _output_category, _output_offset =self.net(img_data_)#Enter a picture to return confidence and offset
                output_category =_output_category.view(-1,1)#Confidence shape transformation. P network output shape: nchw. R network and O network output shape: nv
                # output_offset = _output_offset.view(-1, 4)#Offset shape transform
                # output_landmark = _output_landmark.view(-1, 10)#Is a key shape transform. (not considered temporarily)
                # Calculate losses by category
                category_mask=torch.lt(category_,2)#Exclude some samples. Take out the mask with confidence less than 2
                category=torch.masked_select(category_,category_mask)#Extract data with confidence of 0 and 1 from the label according to the mask
                output_category=torch.masked_select(output_category,category_mask)#Data with confidence of 0 and 1 are extracted from the results according to the mask.
                cls_loss=self.cls_loss_fn(output_category,category)

                #Calculate the loss of the bond
                offset_mask=torch.gt(category_,0)#Negative sample does not participate in operation
                offset=offset_[offset_mask]
                output_offset=_output_offset[offset_mask]
                offset_loss =self.offset_loss_fn(output_offset,offset)

                loss=cls_loss+offset_loss

                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()

                print(" loss:", loss.cpu().data.numpy(), " cls_loss:", cls_loss.cpu().data.numpy(), " offset_loss",
                      offset_loss.cpu().data.numpy())

            torch.save(self.net.state_dict(), self.save_path)
            print("save success")

(9) Precautions

  • When the results meet the requirements, the training can be closed directly because there are saved parameters.

  • 1050 or 1060 for 48-72 hours, perfect.

  • After training for more than 72 hours, there will be over fitting and over learning, and some things that are not human faces will be regarded as human faces.

  • When the network talks about 0.2, it drops very slowly. Do not turn it off.

  • P network can be reduced to about 0.02.

  • There are many photos in the data set, and the network will accept them as human faces.

5. Separate training network at the same time

  • P network
import nets
import train

if __name__ == '__main__':
    net = nets.PNet()

    trainer = train.Trainer(net, './param/pnet.pt', r"C:\celeba4\12")#Transfer in the network, fill in and save the parameter location, and transfer in the data set
    trainer.train()
  • R network
import nets
import train

if __name__ == '__main__':
    net = nets.RNet()

    trainer = train.Trainer(net, './param/rnet.pt', r"C:\celeba4\24")
    trainer.train()

  • O network
import nets
import train
if __name__ == '__main__':
    net = nets.ONet()

    trainer = train.Trainer(net, './param/onet.pt', r"C:\celeba4\48")
    trainer.train()

6. Use of network

(1) Initialization

  • Import three network weights
def __init__(self, pnet_param="./param/pnet.pt", rnet_param="./param/rnet.pt", onet_param="./param/onet.pt",
                 isCuda=True):#Read in three network weights
  • Instantiate three networks
#Instantiate three networks
        self.pnet = nets.PNet()
        self.rnet = nets.RNet()
        self.onet = nets.ONet()
  • Use CUDA or not
self.isCuda = isCuda
if self.isCuda:
	self.pnet.cuda()
	self.rnet.cuda()
	self.onet.cuda()
  • Load parameters to network
self.pnet.load_state_dict(torch.load(pnet_param))
self.rnet.load_state_dict(torch.load(rnet_param))
self.onet.load_state_dict(torch.load(onet_param))
  • batch normalization

    Normalization is different in training and use. When training, use a batch of pictures, and use a picture. Mean and variance are different. When using the network, it does not use the batch normalization of picture data when using, but uses the batch normalization when training the network.

    The following is the code of batch normalization when using the training network. (this example does not use batch normalization. You can load batch normalization by yourself)

self.pnet.eval()
self.rnet.eval()
self.onet.eval()
  • Picture to Tensor

    ToTensor():

Converts a PIL Image or numpy.ndarray (H x W x C) in the range[0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0]if the PIL Image belongs to one of the modes (L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1)or if the numpy.ndarray has dtype = np.uint8

self.__image_transform = transforms.Compose([
            transforms.ToTensor()
        ])
  • Final code
    def __init__(self, pnet_param="./param/pnet.pt", rnet_param="./param/rnet.pt", onet_param="./param/onet.pt",
                 isCuda=True):#Read in three network weights

        self.isCuda = isCuda
        #Instantiate three networks
        self.pnet = nets.PNet()
        self.rnet = nets.RNet()
        self.onet = nets.ONet()

        if self.isCuda:
            self.pnet.cuda()
            self.rnet.cuda()
            self.onet.cuda()
        #Load parameters to network
        self.pnet.load_state_dict(torch.load(pnet_param))
        self.rnet.load_state_dict(torch.load(rnet_param))
        self.onet.load_state_dict(torch.load(onet_param))


        #
        self.pnet.eval()
        self.rnet.eval()
        self.onet.eval()

        self.__image_transform = transforms.Compose([
            transforms.ToTensor()
        ])

(2) P network

  • Analysis

  • Pass in a picture and get a bunch of frames (boxes = []: receive). The format is: [x1,y1,x2,y2,c]. The format is the same as that of IOU.

  • Pass in a picture, width and height, get the minimum side length of the picture, which is used to control the making of pyramid. (minimum side length reduced to 12)

  • Turn the original image into a Tensor, put it in CUDA, and raise the latitude. Because there is no batch for an incoming picture, you need to raise a dimension to ensure that the dimensions are the same. The dimension changes to 1CHW.

  • Increase the image data of bitmap to P network to get the confidence and offset. At this time, the format of confidence and offset is: NCHW.

  • Confidence. Take N and C. Format: 1x1x2x2

     _cls[0][0].cpu().data# _cls[0][0]: take N and C
    

Figure 55 confidence -[] take the offset. Take N. Format: 1x4x2x2
_offest[0].cpu().data#_Offer [0]: Select

Figure 56 offset -[] if the retention reliability is greater than 0.6, then the index with confidence greater than 0.6 will be taken out. (the result with confidence greater than 0.6 is face. The confidence given here is low and the result is poor. The reason is that it's better to choose by mistake than to let it go.)
idxs = torch.nonzero(torch.gt(cls, 0.6))
  • Reverse operation of characteristic graph

Find these reserved result areas on the original image. You need to know: index (two values), offset, confidence, scaling.

for idx in idxs:
                boxes.append(self.__box(idx, offest, cls[idx[0], idx[1]], scale))#cls[idx[0], idx[1]]: confidence level

Reverse the coordinates of the upper left corner and the lower right corner of the original drawing:

Top left corner in the original image: (index * step size) / scale

Bottom right corner of the original image: (index * step size + convolution kernel size) / scaling

_x1 = (start_index[1] * stride) / scale
_y1 = (start_index[0] * stride) / scale
_x2 = (start_index[1] * stride + side_len) / scale
_y2 = (start_index[0] * stride + side_len) / scale

Calculate the coordinate point of the box according to the offset:

Offset formula: (inner x-outer x) / outer border

x1 = _x1 + ow * _offset[0]
y1 = _y1 + oh * _offset[1]
x2 = _x2 + ow * _offset[2]
y2 = _y2 + oh * _offset[3]

Total code:

def __box(self, start_index, offset, cls, scale, stride=2, side_len=12):#Index, offset, confidence, scaling, step size (fixed value), convolution kernel (12).

    	#Upper left and lower right corner of the original
        _x1 = (start_index[1] * stride) / scale
        _y1 = (start_index[0] * stride) / scale
        _x2 = (start_index[1] * stride + side_len) / scale
        _y2 = (start_index[0] * stride + side_len) / scale

        ow = _x2 - _x1
        oh = _y2 - _y1

        _offset = offset[:, start_index[0], start_index[1]]
        x1 = _x1 + ow * _offset[0]
        y1 = _y1 + oh * _offset[1]
        x2 = _x2 + ow * _offset[2]
        y2 = _y2 + oh * _offset[3]

        return [x1, y1, x2, y2, cls]#P network final result. The shape is the same as that of IOU.
  • Network tuning

Low confidence and high threshold cause problems: there are many frames left in P network, which means that the pictures transferred into R network are large, the computation is large, and the network is slow.

  • P network confidence
idxs = torch.nonzero(torch.gt(cls, 0.6))
  • P network threshold
return utils.nms(np.array(boxes), 0.5)
  • Final code
    def __pnet_detect(self, image):#Incoming images

        boxes = []#Receive results (a bunch of frames)

        img = image#picture
        w, h = img.size#Get picture width and height
        min_side_len = min(w, h)#Get the minimum side length to make pyramid

        scale = 1#Scale to 1

        while min_side_len > 12:
            img_data = self.__image_transform(img)#
            if self.isCuda:
                img_data = img_data.cuda()
            img_data.unsqueeze_(0)

            _cls, _offest = self.pnet(img_data)

            cls, offest = _cls[0][0].cpu().data, _offest[0].cpu().data
            idxs = torch.nonzero(torch.gt(cls, 0.6))

            for idx in idxs:
                boxes.append(self.__box(idx, offest, cls[idx[0], idx[1]], scale))

            scale *= 0.7
            _w = int(w * scale)
            _h = int(h * scale)

            img = img.resize((_w, _h))
            min_side_len = min(_w, _h)

        return utils.nms(np.array(boxes), 0.5)

    # Restore the regression to the original map
    def __box(self, start_index, offset, cls, scale, stride=2, side_len=12):

        _x1 = (start_index[1] * stride) / scale
        _y1 = (start_index[0] * stride) / scale
        _x2 = (start_index[1] * stride + side_len) / scale
        _y2 = (start_index[0] * stride + side_len) / scale

        ow = _x2 - _x1
        oh = _y2 - _y1

        _offset = offset[:, start_index[0], start_index[1]]
        x1 = _x1 + ow * _offset[0]
        y1 = _y1 + oh * _offset[1]
        x2 = _x2 + ow * _offset[2]
        y2 = _y2 + oh * _offset[3]

        return [x1, y1, x2, y2, cls]

(3) R network

  • Analysis

  • Define an empty list to hold the deducted data.

  • Pass in the frame of P network.

  • The output of P network may be rectangle or square. First, you need to turn the rectangle into a square and fill it with the background of the original image (using white filling will reduce the norm of the network).

def convert_to_square(bbox):
    square_bbox = bbox.copy()
    if bbox.shape[0] == 0:
        return np.array([])
    h = bbox[:, 3] - bbox[:, 1]
    w = bbox[:, 2] - bbox[:, 0]
    max_side = np.maximum(h, w)
    square_bbox[:, 0] = bbox[:, 0] + w * 0.5 - max_side * 0.5
    square_bbox[:, 1] = bbox[:, 1] + h * 0.5 - max_side * 0.5
    square_bbox[:, 2] = square_bbox[:, 0] + max_side
    square_bbox[:, 3] = square_bbox[:, 1] + max_side


    return square_bbox
  • Take out P network frame according to the result of R network confidence
#R network filter confidence greater than 0.6
        idxs, _ = np.where(cls > 0.6)
        for idx in idxs:
            _box = _pnet_boxes[idx]#Fetch box
  • Final code
    def __rnet_detect(self, image, pnet_boxes):

        _img_dataset = []#Store the deducted data
        _pnet_boxes = utils.convert_to_square(pnet_boxes)#Frame of incoming P network
        #Get four coordinate points of the square
        for _box in _pnet_boxes:
            _x1 = int(_box[0])
            _y1 = int(_box[1])
            _x2 = int(_box[2])
            _y2 = int(_box[3])
			#Cutout
            img = image.crop((_x1, _y1, _x2, _y2))
            img = img.resize((24, 24))
            img_data = self.__image_transform(img)
            _img_dataset.append(img_data)

        img_dataset =torch.stack(_img_dataset)
        if self.isCuda:
            img_dataset = img_dataset.cuda()

        _cls, _offset = self.rnet(img_dataset)

        cls = _cls.cpu().data.numpy()
        offset = _offset.cpu().data.numpy()

        boxes = []
        idxs, _ = np.where(cls > 0.6)
        for idx in idxs:
            _box = _pnet_boxes[idx]
            _x1 = int(_box[0])
            _y1 = int(_box[1])
            _x2 = int(_box[2])
            _y2 = int(_box[3])

            ow = _x2 - _x1
            oh = _y2 - _y1

            x1 = _x1 + ow * offset[idx][0]
            y1 = _y1 + oh * offset[idx][1]
            x2 = _x2 + ow * offset[idx][2]
            y2 = _y2 + oh * offset[idx][3]

            boxes.append([x1, y1, x2, y2, cls[idx][0]])

        return utils.nms(np.array(boxes), 0.5)

(4) O network

Same as R network.

(5) Use network code

  • detect
import torch
from PIL import Image
from PIL import ImageDraw
import numpy as np

from MTCNN.tool import utils

from MTCNN import nets

from torchvision import transforms
import time


class Detector:

    def __init__(self, pnet_param="./param/pnet.pt", rnet_param="./param/rnet.pt", onet_param="./param/onet.pt",
                 isCuda=True):#Read in three network weights

        self.isCuda = isCuda
        #Instantiate three networks
        self.pnet = nets.PNet()
        self.rnet = nets.RNet()
        self.onet = nets.ONet()

        if self.isCuda:
            self.pnet.cuda()
            self.rnet.cuda()
            self.onet.cuda()
        #Load parameters to network
        self.pnet.load_state_dict(torch.load(pnet_param))
        self.rnet.load_state_dict(torch.load(rnet_param))
        self.onet.load_state_dict(torch.load(onet_param))


        #
        self.pnet.eval()
        self.rnet.eval()
        self.onet.eval()

        self.__image_transform = transforms.Compose([
            transforms.ToTensor()
        ])

    def detect(self, image):

        start_time = time.time()
        pnet_boxes = self.__pnet_detect(image)
        # When there is no face in P network, an empty array is passed in
        if pnet_boxes.shape[0] == 0:
            return np.array([])
        end_time = time.time()
        t_pnet = end_time - start_time
        # return pnet_boxes

        start_time = time.time()
        #
        rnet_boxes = self.__rnet_detect(image, pnet_boxes)
        # print( rnet_boxes)
        if rnet_boxes.shape[0] == 0:
            return np.array([])
        end_time = time.time()
        t_rnet = end_time - start_time

        start_time = time.time()
        onet_boxes = self.__onet_detect(image, rnet_boxes)
        if onet_boxes.shape[0] == 0:
            return np.array([])
        end_time = time.time()
        t_onet = end_time - start_time

        t_sum = t_pnet + t_rnet + t_onet

        print("total:{0} pnet:{1} rnet:{2} onet:{3}".format(t_sum, t_pnet, t_rnet, t_onet))

        return onet_boxes

    def __rnet_detect(self, image, pnet_boxes):

        _img_dataset = []#Store the deducted data
        _pnet_boxes = utils.convert_to_square(pnet_boxes)#Frame of incoming P network
        #Get four coordinate points of the square
        for _box in _pnet_boxes:
            _x1 = int(_box[0])
            _y1 = int(_box[1])
            _x2 = int(_box[2])
            _y2 = int(_box[3])
            #Cutout
            img = image.crop((_x1, _y1, _x2, _y2))
            #Change to 24*24
            img = img.resize((24, 24))
            #deformation
            img_data = self.__image_transform(img)
            #Close to list
            _img_dataset.append(img_data)
        #Assemble into matrix
        img_dataset =torch.stack(_img_dataset)
        if self.isCuda:
            img_dataset = img_dataset.cuda()

        _cls, _offset = self.rnet(img_dataset)

        cls = _cls.cpu().data.numpy()
        offset = _offset.cpu().data.numpy()

        boxes = []
        #R network filter confidence greater than 0.6
        idxs, _ = np.where(cls > 0.6)
        #Get four offsets
        for idx in idxs:
            _box = _pnet_boxes[idx]#
            _x1 = int(_box[0])
            _y1 = int(_box[1])
            _x2 = int(_box[2])
            _y2 = int(_box[3])

            ow = _x2 - _x1
            oh = _y2 - _y1

            x1 = _x1 + ow * offset[idx][0]
            y1 = _y1 + oh * offset[idx][1]
            x2 = _x2 + ow * offset[idx][2]
            y2 = _y2 + oh * offset[idx][3]

            boxes.append([x1, y1, x2, y2, cls[idx][0]])

        return utils.nms(np.array(boxes), 0.5)

    def __onet_detect(self, image, rnet_boxes):

        _img_dataset = []
        _rnet_boxes = utils.convert_to_square(rnet_boxes)
        for _box in _rnet_boxes:
            _x1 = int(_box[0])
            _y1 = int(_box[1])
            _x2 = int(_box[2])
            _y2 = int(_box[3])

            img = image.crop((_x1, _y1, _x2, _y2))
            img = img.resize((48, 48))
            img_data = self.__image_transform(img)
            _img_dataset.append(img_data)

        img_dataset = torch.stack(_img_dataset)
        if self.isCuda:
            img_dataset = img_dataset.cuda()

        _cls, _offset = self.onet(img_dataset)

        cls = _cls.cpu().data.numpy()
        offset = _offset.cpu().data.numpy()

        boxes = []
        idxs, _ = np.where(cls > 0.97)
        for idx in idxs:
            _box = _rnet_boxes[idx]
            _x1 = int(_box[0])
            _y1 = int(_box[1])
            _x2 = int(_box[2])
            _y2 = int(_box[3])

            ow = _x2 - _x1
            oh = _y2 - _y1

            x1 = _x1 + ow * offset[idx][0]
            y1 = _y1 + oh * offset[idx][1]
            x2 = _x2 + ow * offset[idx][2]
            y2 = _y2 + oh * offset[idx][3]


            boxes.append([x1, y1, x2, y2, cls[idx][0]])

        return utils.nms(np.array(boxes), 0.7, isMin=True)#Divided by minimum area

    def __pnet_detect(self, image):

        boxes = []#

        img = image
        w, h = img.size
        min_side_len = min(w, h)

        scale = 1

        while min_side_len > 12:
            img_data = self.__image_transform(img)
            if self.isCuda:
                img_data = img_data.cuda()
            img_data.unsqueeze_(0)

            _cls, _offest = self.pnet(img_data)

            cls, offest = _cls[0][0].cpu().data, _offest[0].cpu().data
            idxs = torch.nonzero(torch.gt(cls, 0.6))

            for idx in idxs:
                boxes.append(self.__box(idx, offest, cls[idx[0], idx[1]], scale))
            #Start scaling
            scale *= 0.7
            _w = int(w * scale)
            _h = int(h * scale)

            img = img.resize((_w, _h))#zoom
            min_side_len = min(_w, _h)#Minimum side length

        return utils.nms(np.array(boxes), 0.5)#Threshold 0.5. Keep boxes with IOU less than 0.5

    # Restore the regression to the original map
    def __box(self, start_index, offset, cls, scale, stride=2, side_len=12):

        _x1 = (start_index[1] * stride) / scale
        _y1 = (start_index[0] * stride) / scale
        _x2 = (start_index[1] * stride + side_len) / scale
        _y2 = (start_index[0] * stride + side_len) / scale

        ow = _x2 - _x1
        oh = _y2 - _y1

        _offset = offset[:, start_index[0], start_index[1]]
        x1 = _x1 + ow * _offset[0]
        y1 = _y1 + oh * _offset[1]
        x2 = _x2 + ow * _offset[2]
        y2 = _y2 + oh * _offset[3]

        return [x1, y1, x2, y2, cls]


if __name__ == '__main__':

    image_file = r"D:\\20180222172119.jpg"
    detector = Detector()

    with Image.open(image_file) as im:
        # boxes = detector.detect(im)
        # print("----------------------------")
        boxes = detector.detect(im)
        print(im.size)
        imDraw = ImageDraw.Draw(im)
        for box in boxes:
            x1 = int(box[0])
            y1 = int(box[1])
            x2 = int(box[2])
            y2 = int(box[3])

            print(box[4])
            imDraw.rectangle((x1, y1, x2, y2), outline='red')

        im.show()

7.NMS&IOU

import numpy as np


def iou(box, boxes, isMin = False):
    box_area = (box[2] - box[0]) * (box[3] - box[1])
    area = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
    xx1 = np.maximum(box[0], boxes[:, 0])
    yy1 = np.maximum(box[1], boxes[:, 1])
    xx2 = np.minimum(box[2], boxes[:, 2])
    yy2 = np.minimum(box[3], boxes[:, 3])

    w = np.maximum(0, xx2 - xx1)
    h = np.maximum(0, yy2 - yy1)

    inter = w * h
    if isMin:
        ovr = np.true_divide(inter, np.minimum(box_area, area))
    else:
        ovr = np.true_divide(inter, (box_area + area - inter))

    return ovr


def nms(boxes, thresh=0.3, isMin = False):

    if boxes.shape[0] == 0:
        return np.array([])

    _boxes = boxes[(-boxes[:, 4]).argsort()]
    r_boxes = []

    while _boxes.shape[0] > 1:
        a_box = _boxes[0]
        b_boxes = _boxes[1:]

        r_boxes.append(a_box)

        # print(iou(a_box, b_boxes))

        index = np.where(iou(a_box, b_boxes,isMin) < thresh)
        _boxes = b_boxes[index]

    if _boxes.shape[0] > 0:
        r_boxes.append(_boxes[0])

    return np.stack(r_boxes)


def convert_to_square(bbox):
    square_bbox = bbox.copy()
    if bbox.shape[0] == 0:
        return np.array([])
    h = bbox[:, 3] - bbox[:, 1]
    w = bbox[:, 2] - bbox[:, 0]
    max_side = np.maximum(h, w)
    square_bbox[:, 0] = bbox[:, 0] + w * 0.5 - max_side * 0.5
    square_bbox[:, 1] = bbox[:, 1] + h * 0.5 - max_side * 0.5
    square_bbox[:, 2] = square_bbox[:, 0] + max_side
    square_bbox[:, 3] = square_bbox[:, 1] + max_side


    return square_bbox

def prewhiten(x):
    mean = np.mean(x)
    std = np.std(x)
    std_adj = np.maximum(std, 1.0/np.sqrt(x.size))
    y = np.multiply(np.subtract(x, mean), 1/std_adj)
    return y


if __name__ == '__main__':
    # a = np.array([1,1,11,11])
    # bs = np.array([[1,1,10,10],[11,11,20,20]])
    # print(iou(a,bs))

    bs = np.array([[1, 1, 10, 10, 40], [1, 1, 9, 9, 10], [9, 8, 13, 20, 15], [6, 11, 18, 17, 13]])
    # print(bs[:,3].argsort())
    print(nms(bs))

        print(box[4])
        imDraw.rectangle((x1, y1, x2, y2), outline='red')

    im.show()
### 7.NMS&IOU

```python
import numpy as np


def iou(box, boxes, isMin = False):
    box_area = (box[2] - box[0]) * (box[3] - box[1])
    area = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
    xx1 = np.maximum(box[0], boxes[:, 0])
    yy1 = np.maximum(box[1], boxes[:, 1])
    xx2 = np.minimum(box[2], boxes[:, 2])
    yy2 = np.minimum(box[3], boxes[:, 3])

    w = np.maximum(0, xx2 - xx1)
    h = np.maximum(0, yy2 - yy1)

    inter = w * h
    if isMin:
        ovr = np.true_divide(inter, np.minimum(box_area, area))
    else:
        ovr = np.true_divide(inter, (box_area + area - inter))

    return ovr


def nms(boxes, thresh=0.3, isMin = False):

    if boxes.shape[0] == 0:
        return np.array([])

    _boxes = boxes[(-boxes[:, 4]).argsort()]
    r_boxes = []

    while _boxes.shape[0] > 1:
        a_box = _boxes[0]
        b_boxes = _boxes[1:]

        r_boxes.append(a_box)

        # print(iou(a_box, b_boxes))

        index = np.where(iou(a_box, b_boxes,isMin) < thresh)
        _boxes = b_boxes[index]

    if _boxes.shape[0] > 0:
        r_boxes.append(_boxes[0])

    return np.stack(r_boxes)


def convert_to_square(bbox):
    square_bbox = bbox.copy()
    if bbox.shape[0] == 0:
        return np.array([])
    h = bbox[:, 3] - bbox[:, 1]
    w = bbox[:, 2] - bbox[:, 0]
    max_side = np.maximum(h, w)
    square_bbox[:, 0] = bbox[:, 0] + w * 0.5 - max_side * 0.5
    square_bbox[:, 1] = bbox[:, 1] + h * 0.5 - max_side * 0.5
    square_bbox[:, 2] = square_bbox[:, 0] + max_side
    square_bbox[:, 3] = square_bbox[:, 1] + max_side


    return square_bbox

def prewhiten(x):
    mean = np.mean(x)
    std = np.std(x)
    std_adj = np.maximum(std, 1.0/np.sqrt(x.size))
    y = np.multiply(np.subtract(x, mean), 1/std_adj)
    return y


if __name__ == '__main__':
    # a = np.array([1,1,11,11])
    # bs = np.array([[1,1,10,10],[11,11,20,20]])
    # print(iou(a,bs))

    bs = np.array([[1, 1, 10, 10, 40], [1, 1, 9, 9, 10], [9, 8, 13, 20, 15], [6, 11, 18, 17, 13]])
    # print(bs[:,3].argsort())
    print(nms(bs))

Published 16 original articles, won praise 23, visited 1346
Private letter follow

Tags: network less Python supervisor

Posted on Tue, 11 Feb 2020 06:26:30 -0800 by Zssz