This is the learning notes corresponding to the course

Address: http://www.auto-mooc.com/mooc/detail?mooc_id=F51511B0209FB73D81EAC260B63B2A21

Courseware data storage address: to be updated

### Article directory

# 7.0 neurons and perceptrons

- perceptron
- Shallow neural network
- BP back propagation

SVM and logistic regression can solve many nonlinear problems.

In 2010, Li Feifei, a Google scientist, held an image classification competition. In 2012, Geoffry Hinton, a professor at the University of Toronto, and his team took AlexNet and deep convolution neural network to participate in the ImageNet ILSVRC challenge of that year, winning with an amazing advantage (the error rate was 10% lower than the second place). This paper, which is included in NIPS 2012, is considered to be the beginning of deep learning fever. The spring of neural network.

On October 26, 2017, Hinton published a paper, Capsule Networks (Capsule Networks), which caused a great stir in the AI circle. Hinton shouted, "the age of convolutional neural networks (CNN) is over!" , turning over his research over the past decades.

## 7.1 perceptron

Perceptron is very similar to logical regression, which is more like the improvement of perceptron. There were concepts in the 1950s and applications in the 1960s.

In 1943, some people proposed perceptron according to the working mode of human brain nerve. But the problem is that the working mode is fixed (the problem of cell body processing is fixed).

Introducing the concept of weight, 1956:

### Bias and variance (New)

Reference: https://xuanlan.zhihu.com/p/38471518

Deviation represents the difference between the expected output and the real value of the model.

Variance represents the variability or stability of the model when it forecasts a given observation point.

Here's an example of a shooter shooting. The deviation reflects the deviation between the gunner's aiming point and the target center, while the variance is the gunner's stability. There are also uncontrollable factors in the actual shooting process, such as the influence of wind, which we call noise.

Low Bias bias and low Variance variance can get ideal results, but low Bias and low Variance can't get both. **If you want to reduce the model's Bias, it will improve the model's Variance to some extent, and vice versa. **Take K-NN as an example to see the relationship between Bias and Variance in K-NN and its parameter k. in K-NN, the error form is as follows, here x1,x2 X k is the nearest k neighbor of X in the training data set. When the K value is very small, the Bias is very low, but the Variance is very large. With the increase of K, Bias will increase obviously, but Variance will decrease, which is a relationship between Bias and Variance.

In order to reduce the error rate of the model, it is necessary to make the model more "accurate" in the training data set. This will often increase the complexity of the model, but reduce the generalization ability of the model in the whole data set. For new data, the model pair will be very unstable, so This results in a high Variance, which is over fitting. In order to avoid over fitting, we can't rely on the limited training data completely. We need to add some restrictions (such as generalization) to improve the stability of the model and reduce the Variance. However, as a result, the Bias of the model increases and the fitting ability is not enough to cause under fitting. Therefore, we need to find a tradeoff between Bias and Variance.

Perceptron experiment: Calculate weights

#dataset = [[0,0,0],[0,1,0],[1,0,0],[1,1,1]]

dataset = [[0,0,0],[0,1,1],[1,0,1],[1,1,1]] (the first two parameters of dataset are input and the third is output

#dataset = [[0,0,0],[0,1,1],[1,0,1],[1,1,0]]

At a glance, it can be seen that this is a data set of three relations: logical and, logical or, logical exclusive or

import matplotlib.pyplot as plt import numpy as np def predict(row, weights): activation = weights[0] #weights[0] is bias for i in range(len(row) - 1): #Cyclic addition activation += weights[i + 1] * row[i] #return 1.0 if activation >= 0.0 else 0.0 return 1.0 if activation >0 else 0.0 #Estimation of perceptron parameters using random gradient descent def train_weights(train, l_rate, n_epoch): weights = [0.0 for i in range(len(train[0]))] #Initial value is 0 for epoch in range(n_epoch): #Update parameters with all training values in each round sum_error = 0.0 print("Current parameter value:") print(weights) for row in train: print("Training value: ") print(row) prediction = predict(row, weights) #Use the current weight to predict each training value print("Expected=%d, Predicted=%d" % (row[-1], prediction)) error = row[-1] - prediction #Subtract the predicted value from the training value sum_error += error ** 2 #The number of errors needs to square the error value; prevent positive and negative offsets weights[0] = weights[0] + l_rate * error #Update parameter, use error value directly for i in range(len(row) - 1): #Update all parameters for each error weights[i + 1] = weights[i + 1] + l_rate * error * row[i] #gradient descent print("Parameters after training:") print(weights) print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error)) print(weights) return weights # Calculate weights #dataset = [[0,0,0],[0,1,0],[1,0,0],[1,1,1]] dataset = [[0,0,0],[0,1,1],[1,0,1],[1,1,1]] #dataset = [[0,0,0],[0,1,1],[1,0,1],[1,1,0]] l_rate = 0.1 n_epoch = 5 #n_epoch = 5 weights = train_weights(dataset, l_rate, n_epoch) print(weights) for row in dataset: prediction = predict(row, weights) print("Expected=%d, Predicted=%d" % (row[-1], prediction)) for data in dataset: print(data) if data[-1] == 1: plt.scatter(data[0],data[1],25,'r') else: plt.scatter(data[0], data[1], 25,'g') X = np.arange(-2, 2, 0.1) Y = [-(weights[0]+weights[1]*x)/weights[2] for x in X] plt.plot(X, Y, 'y*') plt.show()

Simple perceptron can't solve nonlinear separable problems, such as "exclusive OR problem". It has a good effect on logical AND and logical OR linear classifiable problems. So, the first few

For ten years, this is also the reason why perceptron has been ignored.

# 8.1 shallow neural network

- Hidden layer
- Activation function
- Feedforward neural network

A single perceptron cannot solve the XOR problem. So, expand the single-layer perceptron!!!

In multi-layer neural network, we do not call the operator as a perceptron, called "neuron"

### Dimension of weight:

input and hidden | hidden and output |
---|---|

In the example, it is 2 * 3; the general formula is: hidden Dimension Xinput dimension | In the example, 1 * 3; the general formula is: input dimension Xhidden dimension |

Feedforward neural network: output does not affect input.

Multilayer neural network can solve the nonlinear problem.

### BP back propagation: updating weights

XOR problem is solved by two-layer neural network

# -*- coding: utf-8 -*- import numpy as np import matplotlib.pyplot as plt def sigmoid(x): #Conversion functions; use in hidden and output layers return 1/(1+np.exp(-x)) def s_prime(z): return np.multiply(z, 1.0-z) #Derivative form of sigmoid function #return np.multiply(sigmoid(z) , (1 - sigmoid(z))) def init_weights(layers, epsilon): weights = [] for i in range(len(layers)-1): #Generate parameters for each layer; each input value is connected to all nodes of the next layer w = np.random.rand(layers[i+1], layers[i]+1) #layers[i+1] number of nodes in the next layer; layers[i]+1 has bias w = w * 2*epsilon - epsilon #The value generated by rand is between 0 ~ 1, epsilon=1, and (2*w-1) can change the value range of w to - 1 ~ 1 weights.append(np.mat(w)) print(weights) return weights #A ﹤ S stores the input value and the calculated sigmoid value of each layer; def back(X, Y, w): # now each para has a grad equals to 0 w_grad = ([np.mat(np.zeros(np.shape(w[i]))) for i in range(len(w))]) # len(w) equals the layer number m, n = X.shape #Training set size 4 * 2 output = np.zeros((m, 1)) # Predicted values of all samples, 4 for i in range(m): #For each training sample, X, Y are matrices x = X[i] y = Y[0,i] # forward propagate a = x #Each training sample a_s = [] for j in range(len(w)): #There are 3-1 layers in total; the length of w is the number of layers of neural network a = np.mat(np.append(1, a)).T #1 for multiplication with bias, the first term of the weights vector is the value of bias a_s.append(a) # The a value of the first L-1 layer is saved here: input layer; hidden layer net = w[j] * a a = sigmoid(net) #The output of the previous layer is the input of the next layer #print('a_s',a_s,) output[i, 0] = a # back propagate delta = a - y.T #Calculate the difference between the predicted value and the real value; it is also the product of cross entropy and sigmoid derivative w_grad[-1] += delta * a_s[-1].T # Gradient of layer L-1 #print("delta:") #print(delta) # Come back, from the bottom two for j in reversed(range(1, len(w))): #len(2)= 2; therefore, j is only 1 delta = np.multiply(w[j].T*delta, s_prime(a_s[j])) # #sigmoid function is used in every layer in the middle; #print(s_prime(a_s[j])) #print('delta new:') #print(delta) #print(a_s[j-1]) w_grad[j-1] += (delta[1:] * a_s[j-1].T) #delta[0] = 0; because s_prime(a_s[j][0]=1); constant terms have no effect on weights #print("weight") #print(delta[1:] * a_s[j-1].T) w_grad = [w_grad[i]/m for i in range(len(w))] #print(w_grad) cost = (1.0 / m) * np.sum(-Y * np.log(output) - (np.array([[1]]) - Y) * np.log(1 - output)) #Cross entropy cost function return {'w_grad': w_grad, 'cost': cost, 'output': output} X = np.mat([[0,0], [0,1], [1,0], [1,1]]) print(X) Y = np.mat([0,1,1,0]) print(Y) layers = [2,2,1] #Neural network structure epochs = 2000 #Iteration times alpha = 0.5 #Parameter update step w = init_weights(layers, 1) result = {'cost': [], 'output': []} w_s = {} for i in range(epochs): back_result = back(X, Y, w) w_grad = back_result.get('w_grad') cost = back_result.get('cost') output_current = back_result.get('output') result['cost'].append(cost) result['output'].append(output_current) for j in range(len(w)): w[j] -= alpha * w_grad[j] if i == 0 or i == (epochs - 1): # print('w_grad', w_grad) w_s['w_' + str(i)] = w_grad[:] plt.plot(result.get('cost')) plt.show() print(w_s) print('output:') print(result.get('output')[0], '\n',result.get('output')[-1]) #Draw points with XOR plt.figure() X = np.asarray(X) Y = np.asarray(Y) for i in range(len(Y[0])): #print(X[i]) #print(Y[0]) if Y[0][i] == 1: plt.scatter(X[i][0],X[i][1],25,'r') else: plt.scatter(X[i][0],X[i][1], 25,'g') #Draw out the parameters calculated according to the back propagation plt.show()

Reference: https://zhanglan.zhihu.com/p/41785031

layers = [2,2,1] #Neural network structure epochs = 2000 #Iteration times alpha = 0.5 #Parameter update step

Super parameters, also known as frame parameters, are the adjustment knobs for our control model structure, function, efficiency, etc. Super parameter is the parameter that affects the final value of the parameter, and it is the frame parameter of the learning model. Super parameters are manually specified and adjusted continuously. The effect of network depends on the value of super parameter.

learning rate Epochs (number of iterations, also known as num of iterations) Num of hidden layers Num of hidden layer units Activation function Batch size (the size of each batch when using mini batch SGD) Optimizer (what optimizer to choose, such as SGD, RMSProp, Adam) When using such as RMSProp, Adam optimizer, β 1, β 2, etc ......

Too many, the above are some of the most common super parameters. The general deep learning framework is to adjust these framework parameters.

In addition, the more layers, the more accurate, but the longer the training time, but the larger the calculation amount; the full connection layer should be less, and the calculation amount is too large

### Definition of parameters:

[parameter] is the ultimate goal of neural network training, the most basic is the weight of neural network W and bias (b). The purpose of training is to find a good set of model parameters to predict the unknown results. We don't need to adjust these parameters. They are automatically updated and generated in the process of model training.

## Two layer neural network for handwritten digit recognition

#Neural network objects are generated. The structure of neural network is three layers, and the number of nodes in each layer is (784, 30, 10) net = Network([784, 30, 10]) #The (Mini batch) gradient descent method is used to train neural networks (weights and offsets) and generate test results. #Number of training rounds = 30, minimum number of samples for random gradient descent method = 10, learning rate = 3.0 net.back(training_data, 30, 10, 3.0, test_data=test_data)

784: because the input image is 28 pixels x 28 pixels

10: Because the output is 0-9, ten numbers

30: the number of nodes is 30, which comes from experience. It can be the product of the number of input layers and output layers, followed by a root sign, or log wait, which comes from experience.

The neural network with two layers and 30 nodes in one layer can also get good results.

Code:

# %load network.py import random import numpy as np import mnist_loader class Network(object): def __init__(self, sizes): """size Is the size of the neural network; (784, 30, 10)That is to say, the input layer is 784 nodes, the hidden layer is 30 nodes, and the output layer is 10 nodes""" self.num_layers = len(sizes) self.sizes = sizes #Generating random initial parameters for fully connected networks #bias is generated for all layers except input layer self.biases = [np.random.randn(y, 1) for y in sizes[1:]] #weights between two adjacent layers, each input has a weight to the node of the next layer self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])] """It is mainly used for test set and viewing effect""" def feedforward(self, a): """Multiply weight and value, add offset, and then use sigmoid Calculation""" for b, w in zip(self.biases, self.weights): a = sigmoid(np.dot(w, a)+b) return a def back(self, training_data, epochs, mini_batch_size, eta, #Random gradient descent test_data=None): """Use Mini-batch To update gradients""" training_data = list(training_data) n = len(training_data) if test_data: test_data = list(test_data) n_test = len(test_data) for j in range(epochs): random.shuffle(training_data) #Select some input points for parameter update mini_batches = [ training_data[k:k+mini_batch_size] for k in range(0, n, mini_batch_size)] for mini_batch in mini_batches: self.update_mini_batch(mini_batch, eta) #Print out the number of correct classifications if test_data: print("Epoch {} : {} / {}".format(j,self.evaluate(test_data),n_test)); else: print("Epoch {} complete".format(j)) def update_mini_batch(self, mini_batch, eta): """Use backward propagation for calculation""" nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] for x, y in mini_batch: delta_nabla_b, delta_nabla_w = self.backprop(x, y) nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)] def backprop(self, x, y): #Get the parameter size of each layer nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] # feedforward activation = x activations = [x] # The calculated value of each layer (calculated by sigmoid); used in parameter update zs = [] # Value of each layer after linear calculation #Forward calculation for each layer for b, w in zip(self.biases, self.weights): z = np.dot(w, activation)+b zs.append(z) activation = sigmoid(z) activations.append(activation) # Backward calculation, the following three lines are parameters for updating the last layer delta = self.cost_derivative(activations[-1], y) * sigmoid_prime(zs[-1]) nabla_b[-1] = delta nabla_w[-1] = np.dot(delta, activations[-2].transpose()) #Starting from the penultimate layer, the calculation method is the same, and the cycle can be carried out #bias and weights are calculated separately #The value of weights depends on the output of the previous layer for layer in range(2, self.num_layers): z = zs[-layer] sp = sigmoid_prime(z) delta = np.dot(self.weights[-layer+1].transpose(), delta) * sp nabla_b[-layer] = delta nabla_w[-layer] = np.dot(delta, activations[-layer-1].transpose()) return (nabla_b, nabla_w) def evaluate(self, test_data): """Count the number of correct classification;argmax Represents the position of the largest value in the array; because output The result is 0,1 Vector, only one value is 1, which is the kind of discrimination""" test_results = [(np.argmax(self.feedforward(x)), y) for (x, y) in test_data] print(test_results) return sum(int(x == y) for (x, y) in test_results) def cost_derivative(self, output_activations, y): #Mean square error is used; differential form of mean square error """Difference between calculated output and real value.""" return (output_activations-y) def sigmoid(z): """The sigmoid function.""" return 1.0/(1.0+np.exp(-z)) def sigmoid_prime(z): """sigmoid Differentiation of function.""" return sigmoid(z)*(1-sigmoid(z)) training_data, validation_data, test_data = mnist_loader.load_data_wrapper() # Neural network objects are generated. The structure of neural network is three layers, and the number of nodes in each layer is (784, 30, 10) net = Network([784, 30, 10]) # The (Mini batch) gradient descent method is used to train neural networks (weights and offsets) and generate test results. # Number of training rounds = 30, minimum number of samples for random gradient descent method = 10, learning rate = 3.0 net.back(training_data, 30, 10, 3.0, test_data=test_data)

## Activation function

Great impact on network performance

function | Somewhat | shortcoming |
---|---|---|

step | It's a good simulation of neurons | 1. It is not continuous differentiable, and the update W of BP process needs to derive the activation function; 2. The input disturbance has a huge impact on the output |

sigmoid | Early widespread use | The constant value of sigmoid function is greater than zero, which slows the convergence speed of the model. For the update state, it may lead to zigzag shape, and the update effect is not very good. Sigmoid has a large amount of calculation, leading to the disappearance of gradient!!!!! |

tanh | Improved sigmoid; range is (- 1, + 1); reciprocal (0, 1) | Large amount of computation, deep enough network, and gradient vanishing problem |

ReLu | It can solve the problem of gradient disappearance | There is a large gradient of samples, leading to some neurons in a "dead" state |

# 9.1 tensorflow application

## data stream

The interactive environment is as follows:

Define data: create Tensor, add Op

Perform calculation: execute the diagram in the session. The session will occupy system resources. After using, resources need to be released

- sess= tf.InteractiveSession()

with tf.Session() as s: print(s.run(x)) #print(z.eval())

Code example:

import tensorflow as tf a = tf.constant(1) b = tf.constant(2) c = tf.constant(3) x = tf.add(a,b) y = tf.add(x,c) z = tf.add(x,y) #print(y.graph) #print(x.graph) with tf.Session() as s: print(s.run(x)) #print(z.eval()) writer = tf.summary.FileWriter('./graph/',tf.get_default_graph()) writer.add_graph(s.graph) writer.close()

variable

- Note: before using variables, you need to initialize variables: w.initializer, then sess.run(op)

You can also use:

Initialization of variables is very important. operation must be performed in sessions.

import tensorflow as tf w = tf.Variable([[1,2,4],[3,4,6]]) v = tf.Variable([[3,4],[1,2],[5,6]]) y = tf.Variable(5.0) op = w.assign(w * 2) u = tf.matmul(w,v) t = tf.sigmoid(y) **x = tf.global_variables_initializer()**# Initialize variables all at once with tf.Session() as s: #s.run(w.initializer) #s.run(v.initializer) #s.run(y.initializer) s.run(x) s.run(op) print(w.value()) print(w.eval()) print(u.eval()) print(s.run(t))

import tensorflow as tf x = tf.placeholder(tf.string) u = tf.placeholder(tf.string) y = tf.placeholder(tf.int32) z = tf.placeholder(tf.float32) t = tf.placeholder(tf.int32) w = tf.Variable(1) op = w.assign(y+t) with tf.Session() as sess: output = sess.run(x, feed_dict={x: 'Hello World'}) output1, o2, o3 = sess.run([u, y, z], feed_dict={u: 'Test String', y: '123', z: 45}) #o3 = sess.run(u, feed_dict={u: 'Test String', y: 123.45, z: 45}) print(output1,o2,output,o3) #print(o2+o3) print(type(o3)) print(tf.string_join([output,output]).eval()) print(sess.run(op,{y:1,t:2})) #print(sess.run(op,{t:2})) #result = sess.run(op) print(x.eval()) #error

Weights, offsets, derivatives and other data are generally represented by variables, because they need to be changed and kept; however, training data is represented by placeholders, not saved, just a transfer function

import tensorflow as tf c=tf.constant(value=1) print(c.graph) print(tf.get_default_graph()) #Generate new graphs g1=tf.Graph() print("g1:",g1) with g1.as_default(): d=tf.constant(value=2) print(d.graph) #Generate new graphs g2=tf.Graph() print("g2:",g2) g2.as_default() e=tf.constant(value=15) print(e.graph) with g1.as_default(): c1 = tf.constant([1.0]) with tf.Graph().as_default() as g2: c2 = tf.constant([2.0]) with tf.Session(graph=g1) as sess1: print(sess1.run(c1)) print(sess1.run(c2)) #print(sess1.run(c)) with tf.Session(graph=g2) as sess2: print(sess2.run(c2)) writer = tf.summary.FileWriter('./graph/',tf.get_default_graph()) writer.add_graph(sess1.graph) writer.close()

## TensorBoard

It's over for now!! Continue tomorrow