# Machine learning notes 8-Logistic regression basis

## 1, Basic knowledge

### 1 meaning of regression

Regression is the best fitting, the process of fitting these points with a straight line, the process of logical regression is the process of finding the best fitting parameters, using the optimization theory.

### 2. Common optimization algorithms in optimization theory

• Small batch random gradient descent method
• Newton method and quasi Newton method
• Lagrange multiplier method
• Heuristic optimization algorithm intelligent algorithm
Artificial neural network, simulated annealing algorithm, tabu search algorithm, particle swarm algorithm, ant colony algorithm, fish swarm algorithm, cuckoo algorithm, genetic algorithm, immune algorithm, evolutionary multi-objective algorithm

Logistic is mainly used for binary classification algorithm. It uses the feature that the threshold value of Sigmoid function is [0,1]. The essence of logistic regression is maximum likelihood parameter estimation.

### 4 Sigmoid function

hθ(x)=g(θTx)h_\theta(x)=g(\theta^Tx)hθ​(x)=g(θTx)
z=[θ0θ1θ2...θn][θ0θ1θ2...θn]T=θnTXz=[\theta_0 \theta_1 \theta_2... \theta_n ][\theta_0 \theta_1 \theta_2... \theta_n ]^T=\theta_n ^TXz=[θ0​θ1​θ2​...θn​][θ0​θ1​θ2​...θn​]T=θnT​X
g(Z)=11+e−zg(Z)={1\over{1+e^{-z}}}g(Z)=1+e−z1​
It can be concluded that: H θ (x)=g(θ Tx)=11+e − θ TX \ BF h_ \ theta (x)=g(\ theta ^ TX) = {1 \ over {1 + e ^ - \ theta ^ TX}}}} h θ (x)=g(θ Tx)=1+e − θ TX 1
Sigmoid function: θ \ theta θ is the parameter column vector (unknown quantity to be solved), and xxx is the sample column vector (given data set). g(z)g(z)g(z) function implements the mapping of any real number to [0,1], so that our data set x0, x1, x2... xn x_, x_, x_... X nx0, x1, x2... xn, whether greater than 1 or less than 0, can be mapped to [0,1] interval classification. h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h.
If we have appropriate parameter column vector θ \ theta θ and sample column vector X, then we can calculate a probability by the above formula for sample X classification. If the probability is greater than 0.5, we can say that the sample is a positive sample, otherwise the sample is a negative sample.
According to the characteristics of Sigmoid function, the following assumptions are made:
P(y=1∣x;θ)=hθ(x)P(y=1|x;\theta)=h_\theta(x)P(y=1∣x;θ)=hθ​(x) P(y=0∣x;θ)=1−hθ(x)P(y=0|x;\theta)=1-h_\theta(x)P(y=0∣x;θ)=1−hθ​(x)
The above formula is the conditional probability that sample xxx belongs to positive sample (y=1) and negative sample (y=0) when sample xxx and parameter θ \ theta θ are known. In an ideal state, according to the above formula, the probability of each point is 1, that is to say, the classification is completely correct. However, considering the actual situation, the closer the sample point probability is to 1, the better the classification effect is. For example, if the probability that a sample belongs to a positive sample is 0.99, it means that the sample belongs to a positive sample. We can combine the above two probability formulas into one:
Cost(h θ (x),y)=h θ (x)y(1 − h θ (x))1 − ycost (H \ theta (x), y = h \ theta (x) ^ y (1-h \ theta (x)) ^ {1-y} Cost(h θ (x), y = h θ (x)y(1 − h θ (x))1 − y we logarithm the whole expression:
Cost(hθ(x),y)=ylog⁡hθ(x)+(1−y)log⁡(1−hθ(x))Cost(h_\theta(x),y)=y\log_{}{h_\theta(x)}+(1-y)\log_{}{(1-h_\theta(x))}Cost(hθ​(x), y) = y log h θ (x) + (1 − y) log (1 − h θ (x)), given a sample, we can use this function to find out the probability of the category the sample belongs to, and the larger the probability is, the better, so that is the maximum value of this function. Now that the probability is out, it's time for MLE. Assuming that the samples are independent of each other, the probability of generating the whole sample set is the product of the probability of generating all samples Where m is the total number of samples, y(i) is the category of the ith sample, and x(i) is the ith sample.
To sum up, theta θ, which satisfies the maximum value of J(θ) J(\theta)J(θ), is the model we need to solve.

First, let's give an example to find the maximum value: f(x) = - x2+4xf(x)=-x^2+4xf(x) = - x2+4x
To find the extreme value, first find the derivative of the function: F ′ (x) = - 2x+4f'(x)=-2x+4f ′ (x) = - 2x+4, so that the derivative is 0, and then get the maximum value of the function f(x).
But in the real environment, the function is not as simple as above. Even if the derivative of the function is calculated, it is difficult to calculate the extreme value of the function accurately. At this time, we can use the iterative method to do it, just like climbing, approaching the extreme value point by point. This method of finding the best fitting parameters is the optimization algorithm. The action of climbing is expressed in mathematical formula as follows: x i + 1 = xi + α∂f(xi) xi \ BF x {I + 1} = x ＋ α \ frac {\ partial f (x {I)} {x} xi+1=xi + α xi ∂ f(xi), where α is the step, that is, the learning rate, to control the update amplitude. The effect is as follows: For example, starting from (0,0), the iteration path is 1 ⟶ 2 ⟶ 3 ⟶ 4 ⟶ ⟶ n, until x is the approximate value of the maximum value of the function, stop iteration.

def Gradient_Ascent_test():
# Derivative of f(x)
def f_prime(x_old):
return -2 * x_old +4
# Initial value, give a value less than x ﹣ NEW
x_old = -1
# Initial value of gradient rise algorithm, i.e. starting from (0,0)
x_new = 0
# Step size, i.e. learning efficiency, controls the magnitude of updates
alpha = 0.01
# Precision, which is the update threshold
presision = 0.00000001
while abs(x_new - x_old) > presision:
x_old = x_new
# The formula mentioned above
x_new = x_old + alpha * f_prime(x_old)
#     Print the extremum approximation of the final solution
print(x_new)

if __name__ == '__main__':


This process is called gradient rise algorithm. Similarly, the extreme value of J(θ) J(\theta)J(θ) function can also be solved in this way. The formula can be written as: θ J: = θ j + α∂J(θ) θ J The Sigmoid function is: H θ (x)=g(θ Tx)=11+e − θ txh_ \ theta (x)=g(\ theta ^ TX) = {1 \ over {1 + e ^ - \ theta ^ TX}}}} h θ (x)=g(θ Tx)=1+e − θ Tx1. Now, only J(θ) J(\theta)J(θ) is required, you can use the gradient rise algorithm to solve the maximum value.
∂J(θ)θj=∂J(θ)∂g(θTx)∗∂g(θTx)∂θTx∗∂θTx∂θj\frac{\partial J(\theta)}{\theta_j}=\frac{\partial J(\theta)}{\partial g(\theta^Tx)}*\frac{\partial g(\theta^Tx)}{\partial \theta^Tx}*\frac{\partial \theta^Tx}{\partial \theta_j}θj​∂J(θ)​=∂g(θTx)∂J(θ)​∗∂θTx∂g(θTx)​∗∂θj​∂θTx​
Among them:
∂J(θ)∂g(θTx)=y∗1g(θTx)+(y−1)∗11−g(θTx)\frac{\partial J(\theta)}{\partial g(\theta^Tx)}=y*\frac{1}{g(\theta^Tx)}+(y-1)*\frac{1}{1-g(\theta^Tx)}∂g(θTx)∂J(θ)​=y∗g(θTx)1​+(y−1)∗1−g(θTx)1​
Then:
g′(z)=ddz11+e−z=1(1+e−z)2(e−z)=11+e−z(1−11+e−z)=g(z)(1−g(z))g'(z)=\frac{d}{dz}\frac{1}{1+e^{-z}}=\frac{1}{(1+e^{-z})^2}(e^{-z})=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})=g(z)(1-g(z))g′(z)=dzd​1+e−z1​=(1+e−z)21​(e−z)=1+e−z1​(1−1+e−z1​)=g(z)(1−g(z))
Available:
∂g(θTx)∂θTx=g(θTx)(1−g(θTx))\frac{\partial g(\theta^Tx)}{\partial \theta^Tx}=g(\theta^Tx)(1-g(\theta^Tx))∂θTx∂g(θTx)​=g(θTx)(1−g(θTx))
Part three:
∂θTx∂θj=∂J(θ1x1+θ2x2...+θnxn)∂θj=xj\frac{\partial \theta^Tx}{\partial \theta_j}=\frac{\partial J(\theta_1x_1+\theta_2x_2...+\theta_nx_n)}{\partial \theta_j}=x_j∂θj​∂θTx​=∂θj​∂J(θ1​x1​+θ2​x2​...+θn​xn​)​=xj​
in summary:
∂J(θ)θj=(y−hθ(x))xj\frac{\partial J(\theta)}{\theta_j}=(y-h_\theta(x))x_jθj​∂J(θ)​=(y−hθ​(x))xj​
I)

## 2, Function function

### 1 display data

The first column of data (X1) is regarded as the value on the x-axis, and the second column of data (X2) is regarded as the value on the y-axis. The last column of data is the classification label. These points are classified according to different labels.

import matplotlib.pyplot as plt
import numpy as np

# Create dataset
dataMat = []
# Create label list
labelMat = []
# Open file
fr = open('Logistic')
# Go to enter and put it in the list
lineArr = line.strip().split()
dataMat.append([1.0,float(lineArr),float(lineArr)])
labelMat.append(int(lineArr))
# Close file
fr.close()
# Return
return dataMat,labelMat

def plotDataSet():
# Array array converted to numpy
dataArr = np.array(dataMat)
# Number of data
n = np.shape(dataMat)
# Positive samples
xcord1 = []
ycord1 = []
# Negative sample
xcord2 = []
ycord2 = []
# Classify by dataset label
for i in range(n):
if int(labelMat[i]) == 1:
# 1 is a positive sample
xcord1.append(dataArr[i,1])
ycord1.append(dataArr[i,2])
else:
# 0 is a negative sample
xcord2.append(dataArr[i,1])
ycord2.append(dataArr[i,2])
fig = plt.figure()
# Draw positive sample
ax.scatter(xcord1,ycord1,s = 20,c = 'red',marker = 's',alpha=.5)
# Draw negative sample
ax.scatter(xcord2,ycord2,s = 20,c = 'green',alpha=.5)
# Drawing title
plt.title('DataSet')
# Drawing label
plt.xlabel('x')
plt.ylabel('y')
# display
plt.show()

if __name__ == '__main__':
plotDataSet()
print(dataMat,labelMat)


The data distribution can be seen from the above figure. If the input of the Sigmoid function is recorded as Z, then z=w0x0 + w1x1 + w2x2, the data can be divided. Where, x0 is the vector of 1, X1 is the first column of data set, and X2 is the second column of data set. If z=0, then 0=w0 + w1x1 + w2x2. The abscissa is X1 and the ordinate is x2. The unknown parameters of this equation are w0, w1, w2, which is the regression coefficient (optimal parameter) we need to find.

### 2 gradient up training algorithm

According to the iterative formula of gradient rise:
In this paper, the author analyzes the characteristics of
θj:=θ+αXT(Y−g(Xθ))\bf\theta_j:=\theta+αX^T(Y-g(X\theta))θj​:=θ+αXT(Y−g(Xθ))

#!/user/bin/env python
# -*- coding:utf-8 -*-
#@Time  : 2020/3/16 9:51
#@Author: fangyuan

import numpy as np

# Create dataset
dataMat = []
# Create label list
labelMat = []
# Open file
fr = open('Logistic')
# Go to enter and put it in the list
lineArr = line.strip().split()
dataMat.append([1.0,float(lineArr),float(lineArr)])
labelMat.append(int(lineArr))
# Close file
fr.close()
# Return
return dataMat,labelMat

def sigmoid(inX):
return 1.0/(1+np.exp(-inX))

# mat converted to numpy
dataMatrix = np.mat(dataMatIn)
# Convert to numpy's mat and transpose
labelMat = np.mat(classLabels).transpose()
# Returns the size of dataMatrix. m is the number of rows and n is the number of columns
m,n = np.shape(dataMatrix)
# The mobile step size, which is the learning rate, controls the magnitude of the update
alpha = 0.001
# Maximum number of iterations
maxCycles = 500
weights = np.ones((n,1))
# Vectorization formula of gradient rise
for k in range(maxCycles):
h = sigmoid(dataMatrix * weights)
error = (labelMat - h)
weights = weights + alpha * dataMatrix.transpose() * error
# Convert matrix to array, return weight array (optimal parameter)
return weights.getA()

if __name__ == '__main__': The regression coefficients [w0,w1,w2] have been solved. By solving the parameters, we can determine the separation line between different types of data and draw the decision boundary.

### 3 draw decision boundary

#!/user/bin/env python
# -*- coding:utf-8 -*-
#@Time  : 2020/3/16 10:06
#@Author: fangyuan
#@File: logistic regression drawing decision boundary.py

import numpy as np
import matplotlib.pyplot as plt

# Create dataset
dataMat = []
# Create label list
labelMat = []
# Open file
fr = open('Logistic')
# Go to enter and put it in the list
lineArr = line.strip().split()
dataMat.append([1.0,float(lineArr),float(lineArr)])
labelMat.append(int(lineArr))
# Close file
fr.close()
# Return
return dataMat,labelMat

def sigmoid(inX):
return 1.0/(1+np.exp(-inX))

# mat converted to numpy
dataMatrix = np.mat(dataMatIn)
# Convert to numpy's mat and transpose
labelMat = np.mat(classLabels).transpose()
# Returns the size of dataMatrix. m is the number of rows and n is the number of columns
m,n = np.shape(dataMatrix)
# The mobile step size, which is the learning rate, controls the magnitude of the update
alpha = 0.001
# Maximum number of iterations
maxCycles = 500
weights = np.ones((n,1))
# Vectorization formula of gradient rise
for k in range(maxCycles):
h = sigmoid(dataMatrix * weights)
error = (labelMat - h)
weights = weights + alpha * dataMatrix.transpose() * error
# Convert matrix to array, return weight array (optimal parameter)
return weights.getA()

def plotBestFit(weights):
# Array array converted to numpy
dataArr = np.array(dataMat)
# Number of data
n = np.shape(dataMat)
# Positive samples
xcord1 = []
ycord1 = []
# Negative sample
xcord2 = []
ycord2 = []
# Classify by dataset label
for i in range(n):
# 1 is a positive sample
if int(labelMat[i]) == 1:
xcord1.append(dataArr[i,1])
ycord1.append(dataArr[i,2])
# 0 is a negative sample
else:
xcord2.append(dataArr[i,1])
ycord2.append(dataArr[i,2])
fig = plt.figure()
# Draw positive sample
ax.scatter(xcord1,ycord1,s = 20,c = 'red',marker = 's',alpha =.5)
# Draw negative sample
ax.scatter(xcord2,ycord2,s = 20,c = 'green',alpha=.5)
x = np.arange(-3.0,3.0,0.1)
y = (-weights - weights * x) / weights
ax.plot(x,y)
# Drawing title
plt.title('BestFit')
# Drawing label
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()

if __name__ == '__main__': The purpose of Logistic regression is to find the best fitting parameters of a nonlinear function Sigmoid, and the solution process can be completed by the optimization algorithm.  