## 1, Basic knowledge

### 1 meaning of regression

Regression is the best fitting, the process of fitting these points with a straight line, the process of logical regression is the process of finding the best fitting parameters, using the optimization theory.

### 2. Common optimization algorithms in optimization theory

- Gradient descent method and gradient rise method
- Random gradient descent method
- Batch gradient descent method
- Small batch random gradient descent method
- Newton method and quasi Newton method
- conjugate gradient method
- Lagrange multiplier method
- Heuristic optimization algorithm intelligent algorithm

Artificial neural network, simulated annealing algorithm, tabu search algorithm, particle swarm algorithm, ant colony algorithm, fish swarm algorithm, cuckoo algorithm, genetic algorithm, immune algorithm, evolutionary multi-objective algorithm

### The thought of logic regression and its advantages and disadvantages

Logistic is mainly used for binary classification algorithm. It uses the feature that the threshold value of Sigmoid function is [0,1]. The essence of logistic regression is maximum likelihood parameter estimation.

### 4 Sigmoid function

hθ(x)=g(θTx)h_\theta(x)=g(\theta^Tx)hθ(x)=g(θTx)

z=[θ0θ1θ2...θn][θ0θ1θ2...θn]T=θnTXz=[\theta_0 \theta_1 \theta_2... \theta_n ][\theta_0 \theta_1 \theta_2... \theta_n ]^T=\theta_n ^TXz=[θ0θ1θ2...θn][θ0θ1θ2...θn]T=θnTX

g(Z)=11+e−zg(Z)={1\over{1+e^{-z}}}g(Z)=1+e−z1

It can be concluded that: H θ (x)=g(θ Tx)=11+e − θ TX \ BF h_ \ theta (x)=g(\ theta ^ TX) = {1 \ over {1 + e ^ - \ theta ^ TX}}}} h θ (x)=g(θ Tx)=1+e − θ TX 1

Sigmoid function:

θ \ theta θ is the parameter column vector (unknown quantity to be solved), and xxx is the sample column vector (given data set). g(z)g(z)g(z) function implements the mapping of any real number to [0,1], so that our data set x0, x1, x2... xn x_, x_, x_... X nx0, x1, x2... xn, whether greater than 1 or less than 0, can be mapped to [0,1] interval classification. h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h.

If we have appropriate parameter column vector θ \ theta θ and sample column vector X, then we can calculate a probability by the above formula for sample X classification. If the probability is greater than 0.5, we can say that the sample is a positive sample, otherwise the sample is a negative sample.

According to the characteristics of Sigmoid function, the following assumptions are made:

P(y=1∣x;θ)=hθ(x)P(y=1|x;\theta)=h_\theta(x)P(y=1∣x;θ)=hθ(x) P(y=0∣x;θ)=1−hθ(x)P(y=0|x;\theta)=1-h_\theta(x)P(y=0∣x;θ)=1−hθ(x)

The above formula is the conditional probability that sample xxx belongs to positive sample (y=1) and negative sample (y=0) when sample xxx and parameter θ \ theta θ are known. In an ideal state, according to the above formula, the probability of each point is 1, that is to say, the classification is completely correct. However, considering the actual situation, the closer the sample point probability is to 1, the better the classification effect is. For example, if the probability that a sample belongs to a positive sample is 0.99, it means that the sample belongs to a positive sample. We can combine the above two probability formulas into one:

Cost(h θ (x),y)=h θ (x)y(1 − h θ (x))1 − ycost (H \ theta (x), y = h \ theta (x) ^ y (1-h \ theta (x)) ^ {1-y} Cost(h θ (x), y = h θ (x)y(1 − h θ (x))1 − y we logarithm the whole expression:

Cost(hθ(x),y)=yloghθ(x)+(1−y)log(1−hθ(x))Cost(h_\theta(x),y)=y\log_{}{h_\theta(x)}+(1-y)\log_{}{(1-h_\theta(x))}Cost(hθ(x), y) = y log h θ (x) + (1 − y) log (1 − h θ (x)), given a sample, we can use this function to find out the probability of the category the sample belongs to, and the larger the probability is, the better, so that is the maximum value of this function. Now that the probability is out, it's time for MLE. Assuming that the samples are independent of each other, the probability of generating the whole sample set is the product of the probability of generating all samples

Where m is the total number of samples, y(i) is the category of the ith sample, and x(i) is the ith sample.

To sum up, theta θ, which satisfies the maximum value of J(θ) J(\theta)J(θ), is the model we need to solve.

### 5 gradient rise method

First, let's give an example to find the maximum value: f(x) = - x2+4xf(x)=-x^2+4xf(x) = - x2+4x

To find the extreme value, first find the derivative of the function: F ′ (x) = - 2x+4f'(x)=-2x+4f ′ (x) = - 2x+4, so that the derivative is 0, and then get the maximum value of the function f(x).

But in the real environment, the function is not as simple as above. Even if the derivative of the function is calculated, it is difficult to calculate the extreme value of the function accurately. At this time, we can use the iterative method to do it, just like climbing, approaching the extreme value point by point. This method of finding the best fitting parameters is the optimization algorithm. The action of climbing is expressed in mathematical formula as follows: x i + 1 = xi + α∂f(xi) xi \ BF x {I + 1} = x ＋ α \ frac {\ partial f (x {I)} {x} xi+1=xi + α xi ∂ f(xi), where α is the step, that is, the learning rate, to control the update amplitude. The effect is as follows:For example, starting from (0,0), the iteration path is 1 ⟶ 2 ⟶ 3 ⟶ 4 ⟶ ⟶ n, until x is the approximate value of the maximum value of the function, stop iteration.

def Gradient_Ascent_test(): # Derivative of f(x) def f_prime(x_old): return -2 * x_old +4 # Initial value, give a value less than x ﹣ NEW x_old = -1 # Initial value of gradient rise algorithm, i.e. starting from (0,0) x_new = 0 # Step size, i.e. learning efficiency, controls the magnitude of updates alpha = 0.01 # Precision, which is the update threshold presision = 0.00000001 while abs(x_new - x_old) > presision: x_old = x_new # The formula mentioned above x_new = x_old + alpha * f_prime(x_old) # Print the extremum approximation of the final solution print(x_new) if __name__ == '__main__': Gradient_Ascent_test()

This process is called gradient rise algorithm. Similarly, the extreme value of J(θ) J(\theta)J(θ) function can also be solved in this way. The formula can be written as: θ J: = θ j + α∂J(θ) θ J

The Sigmoid function is: H θ (x)=g(θ Tx)=11+e − θ txh_ \ theta (x)=g(\ theta ^ TX) = {1 \ over {1 + e ^ - \ theta ^ TX}}}} h θ (x)=g(θ Tx)=1+e − θ Tx1. Now, only J(θ) J(\theta)J(θ) is required, you can use the gradient rise algorithm to solve the maximum value.

∂J(θ)θj=∂J(θ)∂g(θTx)∗∂g(θTx)∂θTx∗∂θTx∂θj\frac{\partial J(\theta)}{\theta_j}=\frac{\partial J(\theta)}{\partial g(\theta^Tx)}*\frac{\partial g(\theta^Tx)}{\partial \theta^Tx}*\frac{\partial \theta^Tx}{\partial \theta_j}θj∂J(θ)=∂g(θTx)∂J(θ)∗∂θTx∂g(θTx)∗∂θj∂θTx

Among them:

∂J(θ)∂g(θTx)=y∗1g(θTx)+(y−1)∗11−g(θTx)\frac{\partial J(\theta)}{\partial g(\theta^Tx)}=y*\frac{1}{g(\theta^Tx)}+(y-1)*\frac{1}{1-g(\theta^Tx)}∂g(θTx)∂J(θ)=y∗g(θTx)1+(y−1)∗1−g(θTx)1

Then:

g′(z)=ddz11+e−z=1(1+e−z)2(e−z)=11+e−z(1−11+e−z)=g(z)(1−g(z))g'(z)=\frac{d}{dz}\frac{1}{1+e^{-z}}=\frac{1}{(1+e^{-z})^2}(e^{-z})=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})=g(z)(1-g(z))g′(z)=dzd1+e−z1=(1+e−z)21(e−z)=1+e−z1(1−1+e−z1)=g(z)(1−g(z))

Available:

∂g(θTx)∂θTx=g(θTx)(1−g(θTx))\frac{\partial g(\theta^Tx)}{\partial \theta^Tx}=g(\theta^Tx)(1-g(\theta^Tx))∂θTx∂g(θTx)=g(θTx)(1−g(θTx))

Part three:

∂θTx∂θj=∂J(θ1x1+θ2x2...+θnxn)∂θj=xj\frac{\partial \theta^Tx}{\partial \theta_j}=\frac{\partial J(\theta_1x_1+\theta_2x_2...+\theta_nx_n)}{\partial \theta_j}=x_j∂θj∂θTx=∂θj∂J(θ1x1+θ2x2...+θnxn)=xj

in summary:

∂J(θ)θj=(y−hθ(x))xj\frac{\partial J(\theta)}{\theta_j}=(y-h_\theta(x))x_jθj∂J(θ)=(y−hθ(x))xj

I)

## 2, Function function

### 1 display data

The first column of data (X1) is regarded as the value on the x-axis, and the second column of data (X2) is regarded as the value on the y-axis. The last column of data is the classification label. These points are classified according to different labels.

import matplotlib.pyplot as plt import numpy as np def loadDataSet(): # Create dataset dataMat = [] # Create label list labelMat = [] # Open file fr = open('Logistic') # Read by line for line in fr.readlines(): # Go to enter and put it in the list lineArr = line.strip().split() # Add data dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])]) # Add tags labelMat.append(int(lineArr[2])) # Close file fr.close() # Return return dataMat,labelMat def plotDataSet(): # Load dataset dataMat,labelMat = loadDataSet() # Array array converted to numpy dataArr = np.array(dataMat) # Number of data n = np.shape(dataMat)[0] # Positive samples xcord1 = [] ycord1 = [] # Negative sample xcord2 = [] ycord2 = [] # Classify by dataset label for i in range(n): if int(labelMat[i]) == 1: # 1 is a positive sample xcord1.append(dataArr[i,1]) ycord1.append(dataArr[i,2]) else: # 0 is a negative sample xcord2.append(dataArr[i,1]) ycord2.append(dataArr[i,2]) fig = plt.figure() # Add subplot ax = fig.add_subplot(111) # Draw positive sample ax.scatter(xcord1,ycord1,s = 20,c = 'red',marker = 's',alpha=.5) # Draw negative sample ax.scatter(xcord2,ycord2,s = 20,c = 'green',alpha=.5) # Drawing title plt.title('DataSet') # Drawing label plt.xlabel('x') plt.ylabel('y') # display plt.show() if __name__ == '__main__': plotDataSet() dataMat,labelMat = loadDataSet() print(dataMat,labelMat)

The data distribution can be seen from the above figure. If the input of the Sigmoid function is recorded as Z, then z=w0x0 + w1x1 + w2x2, the data can be divided. Where, x0 is the vector of 1, X1 is the first column of data set, and X2 is the second column of data set. If z=0, then 0=w0 + w1x1 + w2x2. The abscissa is X1 and the ordinate is x2. The unknown parameters of this equation are w0, w1, w2, which is the regression coefficient (optimal parameter) we need to find.

### 2 gradient up training algorithm

According to the iterative formula of gradient rise:

In this paper, the author analyzes the characteristics of

θj:=θ+αXT(Y−g(Xθ))\bf\theta_j:=\theta+αX^T(Y-g(X\theta))θj:=θ+αXT(Y−g(Xθ))

#!/user/bin/env python # -*- coding:utf-8 -*- #@Time : 2020/3/16 9:51 #@Author: fangyuan #@File: logistic regression gradient rise.py import numpy as np def loadDataSet(): # Create dataset dataMat = [] # Create label list labelMat = [] # Open file fr = open('Logistic') # Read by line for line in fr.readlines(): # Go to enter and put it in the list lineArr = line.strip().split() # Add data dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])]) # Add tags labelMat.append(int(lineArr[2])) # Close file fr.close() # Return return dataMat,labelMat def sigmoid(inX): return 1.0/(1+np.exp(-inX)) def gradAscent(dataMatIn,classLabels): # mat converted to numpy dataMatrix = np.mat(dataMatIn) # Convert to numpy's mat and transpose labelMat = np.mat(classLabels).transpose() # Returns the size of dataMatrix. m is the number of rows and n is the number of columns m,n = np.shape(dataMatrix) # The mobile step size, which is the learning rate, controls the magnitude of the update alpha = 0.001 # Maximum number of iterations maxCycles = 500 weights = np.ones((n,1)) # Vectorization formula of gradient rise for k in range(maxCycles): h = sigmoid(dataMatrix * weights) error = (labelMat - h) weights = weights + alpha * dataMatrix.transpose() * error # Convert matrix to array, return weight array (optimal parameter) return weights.getA() if __name__ == '__main__': dataMat,labelMat = loadDataSet() print(gradAscent(dataMat,labelMat))

The regression coefficients [w0,w1,w2] have been solved. By solving the parameters, we can determine the separation line between different types of data and draw the decision boundary.

### 3 draw decision boundary

#!/user/bin/env python # -*- coding:utf-8 -*- #@Time : 2020/3/16 10:06 #@Author: fangyuan #@File: logistic regression drawing decision boundary.py import numpy as np import matplotlib.pyplot as plt def loadDataSet(): # Create dataset dataMat = [] # Create label list labelMat = [] # Open file fr = open('Logistic') # Read by line for line in fr.readlines(): # Go to enter and put it in the list lineArr = line.strip().split() # Add data dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])]) # Add tags labelMat.append(int(lineArr[2])) # Close file fr.close() # Return return dataMat,labelMat def sigmoid(inX): return 1.0/(1+np.exp(-inX)) def gradAscent(dataMatIn,classLabels): # mat converted to numpy dataMatrix = np.mat(dataMatIn) # Convert to numpy's mat and transpose labelMat = np.mat(classLabels).transpose() # Returns the size of dataMatrix. m is the number of rows and n is the number of columns m,n = np.shape(dataMatrix) # The mobile step size, which is the learning rate, controls the magnitude of the update alpha = 0.001 # Maximum number of iterations maxCycles = 500 weights = np.ones((n,1)) # Vectorization formula of gradient rise for k in range(maxCycles): h = sigmoid(dataMatrix * weights) error = (labelMat - h) weights = weights + alpha * dataMatrix.transpose() * error # Convert matrix to array, return weight array (optimal parameter) return weights.getA() def plotBestFit(weights): # Load dataset dataMat,labelMat = loadDataSet() # Array array converted to numpy dataArr = np.array(dataMat) # Number of data n = np.shape(dataMat)[0] # Positive samples xcord1 = [] ycord1 = [] # Negative sample xcord2 = [] ycord2 = [] # Classify by dataset label for i in range(n): # 1 is a positive sample if int(labelMat[i]) == 1: xcord1.append(dataArr[i,1]) ycord1.append(dataArr[i,2]) # 0 is a negative sample else: xcord2.append(dataArr[i,1]) ycord2.append(dataArr[i,2]) fig = plt.figure() # Add subplot ax = fig.add_subplot(111) # Draw positive sample ax.scatter(xcord1,ycord1,s = 20,c = 'red',marker = 's',alpha =.5) # Draw negative sample ax.scatter(xcord2,ycord2,s = 20,c = 'green',alpha=.5) x = np.arange(-3.0,3.0,0.1) y = (-weights[0] - weights[1] * x) / weights[2] ax.plot(x,y) # Drawing title plt.title('BestFit') # Drawing label plt.xlabel('X1') plt.ylabel('X2') plt.show() if __name__ == '__main__': dataMat,labelMat = loadDataSet() weights = gradAscent(dataMat,labelMat) print(weights) plotBestFit(weights)

The purpose of Logistic regression is to find the best fitting parameters of a nonlinear function Sigmoid, and the solution process can be completed by the optimization algorithm.