Machine learning notes 8-Logistic regression basis

1, Basic knowledge

1 meaning of regression

Regression is the best fitting, the process of fitting these points with a straight line, the process of logical regression is the process of finding the best fitting parameters, using the optimization theory.

2. Common optimization algorithms in optimization theory

  • Gradient descent method and gradient rise method
  • Random gradient descent method
  • Batch gradient descent method
  • Small batch random gradient descent method
  • Newton method and quasi Newton method
  • conjugate gradient method
  • Lagrange multiplier method
  • Heuristic optimization algorithm intelligent algorithm
    Artificial neural network, simulated annealing algorithm, tabu search algorithm, particle swarm algorithm, ant colony algorithm, fish swarm algorithm, cuckoo algorithm, genetic algorithm, immune algorithm, evolutionary multi-objective algorithm

The thought of logic regression and its advantages and disadvantages

Logistic is mainly used for binary classification algorithm. It uses the feature that the threshold value of Sigmoid function is [0,1]. The essence of logistic regression is maximum likelihood parameter estimation.

4 Sigmoid function

hθ(x)=g(θTx)h_\theta(x)=g(\theta^Tx)hθ​(x)=g(θTx)
z=[θ0θ1θ2...θn][θ0θ1θ2...θn]T=θnTXz=[\theta_0 \theta_1 \theta_2... \theta_n ][\theta_0 \theta_1 \theta_2... \theta_n ]^T=\theta_n ^TXz=[θ0​θ1​θ2​...θn​][θ0​θ1​θ2​...θn​]T=θnT​X
g(Z)=11+e−zg(Z)={1\over{1+e^{-z}}}g(Z)=1+e−z1​
It can be concluded that: H θ (x)=g(θ Tx)=11+e − θ TX \ BF h_ \ theta (x)=g(\ theta ^ TX) = {1 \ over {1 + e ^ - \ theta ^ TX}}}} h θ (x)=g(θ Tx)=1+e − θ TX 1
Sigmoid function:
θ \ theta θ is the parameter column vector (unknown quantity to be solved), and xxx is the sample column vector (given data set). g(z)g(z)g(z) function implements the mapping of any real number to [0,1], so that our data set x0, x1, x2... xn x_, x_, x_... X nx0, x1, x2... xn, whether greater than 1 or less than 0, can be mapped to [0,1] interval classification. h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h (x) h.
If we have appropriate parameter column vector θ \ theta θ and sample column vector X, then we can calculate a probability by the above formula for sample X classification. If the probability is greater than 0.5, we can say that the sample is a positive sample, otherwise the sample is a negative sample.
According to the characteristics of Sigmoid function, the following assumptions are made:
P(y=1∣x;θ)=hθ(x)P(y=1|x;\theta)=h_\theta(x)P(y=1∣x;θ)=hθ​(x) P(y=0∣x;θ)=1−hθ(x)P(y=0|x;\theta)=1-h_\theta(x)P(y=0∣x;θ)=1−hθ​(x)
The above formula is the conditional probability that sample xxx belongs to positive sample (y=1) and negative sample (y=0) when sample xxx and parameter θ \ theta θ are known. In an ideal state, according to the above formula, the probability of each point is 1, that is to say, the classification is completely correct. However, considering the actual situation, the closer the sample point probability is to 1, the better the classification effect is. For example, if the probability that a sample belongs to a positive sample is 0.99, it means that the sample belongs to a positive sample. We can combine the above two probability formulas into one:
Cost(h θ (x),y)=h θ (x)y(1 − h θ (x))1 − ycost (H \ theta (x), y = h \ theta (x) ^ y (1-h \ theta (x)) ^ {1-y} Cost(h θ (x), y = h θ (x)y(1 − h θ (x))1 − y we logarithm the whole expression:
Cost(hθ(x),y)=ylog⁡hθ(x)+(1−y)log⁡(1−hθ(x))Cost(h_\theta(x),y)=y\log_{}{h_\theta(x)}+(1-y)\log_{}{(1-h_\theta(x))}Cost(hθ​(x), y) = y log h θ (x) + (1 − y) log (1 − h θ (x)), given a sample, we can use this function to find out the probability of the category the sample belongs to, and the larger the probability is, the better, so that is the maximum value of this function. Now that the probability is out, it's time for MLE. Assuming that the samples are independent of each other, the probability of generating the whole sample set is the product of the probability of generating all samples
Where m is the total number of samples, y(i) is the category of the ith sample, and x(i) is the ith sample.
To sum up, theta θ, which satisfies the maximum value of J(θ) J(\theta)J(θ), is the model we need to solve.

5 gradient rise method

First, let's give an example to find the maximum value: f(x) = - x2+4xf(x)=-x^2+4xf(x) = - x2+4x
To find the extreme value, first find the derivative of the function: F ′ (x) = - 2x+4f'(x)=-2x+4f ′ (x) = - 2x+4, so that the derivative is 0, and then get the maximum value of the function f(x).
But in the real environment, the function is not as simple as above. Even if the derivative of the function is calculated, it is difficult to calculate the extreme value of the function accurately. At this time, we can use the iterative method to do it, just like climbing, approaching the extreme value point by point. This method of finding the best fitting parameters is the optimization algorithm. The action of climbing is expressed in mathematical formula as follows: x i + 1 = xi + α∂f(xi) xi \ BF x {I + 1} = x + α \ frac {\ partial f (x {I)} {x} xi+1=xi + α xi ∂ f(xi), where α is the step, that is, the learning rate, to control the update amplitude. The effect is as follows:For example, starting from (0,0), the iteration path is 1 ⟶ 2 ⟶ 3 ⟶ 4 ⟶ ⟶ n, until x is the approximate value of the maximum value of the function, stop iteration.

def Gradient_Ascent_test():
    # Derivative of f(x)
    def f_prime(x_old):
        return -2 * x_old +4
    # Initial value, give a value less than x ﹣ NEW
    x_old = -1
    # Initial value of gradient rise algorithm, i.e. starting from (0,0)
    x_new = 0
    # Step size, i.e. learning efficiency, controls the magnitude of updates
    alpha = 0.01
    # Precision, which is the update threshold
    presision = 0.00000001
    while abs(x_new - x_old) > presision:
        x_old = x_new
        # The formula mentioned above
        x_new = x_old + alpha * f_prime(x_old)
    #     Print the extremum approximation of the final solution
    print(x_new)

if __name__ == '__main__':
    Gradient_Ascent_test()

This process is called gradient rise algorithm. Similarly, the extreme value of J(θ) J(\theta)J(θ) function can also be solved in this way. The formula can be written as: θ J: = θ j + α∂J(θ) θ J
The Sigmoid function is: H θ (x)=g(θ Tx)=11+e − θ txh_ \ theta (x)=g(\ theta ^ TX) = {1 \ over {1 + e ^ - \ theta ^ TX}}}} h θ (x)=g(θ Tx)=1+e − θ Tx1. Now, only J(θ) J(\theta)J(θ) is required, you can use the gradient rise algorithm to solve the maximum value.
∂J(θ)θj=∂J(θ)∂g(θTx)∗∂g(θTx)∂θTx∗∂θTx∂θj\frac{\partial J(\theta)}{\theta_j}=\frac{\partial J(\theta)}{\partial g(\theta^Tx)}*\frac{\partial g(\theta^Tx)}{\partial \theta^Tx}*\frac{\partial \theta^Tx}{\partial \theta_j}θj​∂J(θ)​=∂g(θTx)∂J(θ)​∗∂θTx∂g(θTx)​∗∂θj​∂θTx​
Among them:
∂J(θ)∂g(θTx)=y∗1g(θTx)+(y−1)∗11−g(θTx)\frac{\partial J(\theta)}{\partial g(\theta^Tx)}=y*\frac{1}{g(\theta^Tx)}+(y-1)*\frac{1}{1-g(\theta^Tx)}∂g(θTx)∂J(θ)​=y∗g(θTx)1​+(y−1)∗1−g(θTx)1​
Then:
g′(z)=ddz11+e−z=1(1+e−z)2(e−z)=11+e−z(1−11+e−z)=g(z)(1−g(z))g'(z)=\frac{d}{dz}\frac{1}{1+e^{-z}}=\frac{1}{(1+e^{-z})^2}(e^{-z})=\frac{1}{1+e^{-z}}(1-\frac{1}{1+e^{-z}})=g(z)(1-g(z))g′(z)=dzd​1+e−z1​=(1+e−z)21​(e−z)=1+e−z1​(1−1+e−z1​)=g(z)(1−g(z))
Available:
∂g(θTx)∂θTx=g(θTx)(1−g(θTx))\frac{\partial g(\theta^Tx)}{\partial \theta^Tx}=g(\theta^Tx)(1-g(\theta^Tx))∂θTx∂g(θTx)​=g(θTx)(1−g(θTx))
Part three:
∂θTx∂θj=∂J(θ1x1+θ2x2...+θnxn)∂θj=xj\frac{\partial \theta^Tx}{\partial \theta_j}=\frac{\partial J(\theta_1x_1+\theta_2x_2...+\theta_nx_n)}{\partial \theta_j}=x_j∂θj​∂θTx​=∂θj​∂J(θ1​x1​+θ2​x2​...+θn​xn​)​=xj​
in summary:
∂J(θ)θj=(y−hθ(x))xj\frac{\partial J(\theta)}{\theta_j}=(y-h_\theta(x))x_jθj​∂J(θ)​=(y−hθ​(x))xj​
I)

2, Function function

1 display data

The first column of data (X1) is regarded as the value on the x-axis, and the second column of data (X2) is regarded as the value on the y-axis. The last column of data is the classification label. These points are classified according to different labels.

import matplotlib.pyplot as plt
import numpy as np

def loadDataSet():
    # Create dataset
    dataMat = []
    # Create label list
    labelMat = []
    # Open file
    fr = open('Logistic')
    # Read by line
    for line in fr.readlines():
        # Go to enter and put it in the list
        lineArr = line.strip().split()
        # Add data
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        # Add tags
        labelMat.append(int(lineArr[2]))
    # Close file
    fr.close()
    # Return
    return dataMat,labelMat

def plotDataSet():
    # Load dataset
    dataMat,labelMat = loadDataSet()
    # Array array converted to numpy
    dataArr = np.array(dataMat)
    # Number of data
    n = np.shape(dataMat)[0]
    # Positive samples
    xcord1 = []
    ycord1 = []
    # Negative sample
    xcord2 = []
    ycord2 = []
    # Classify by dataset label
    for i in range(n):
        if int(labelMat[i]) == 1:
            # 1 is a positive sample
            xcord1.append(dataArr[i,1])
            ycord1.append(dataArr[i,2])
        else:
            # 0 is a negative sample
            xcord2.append(dataArr[i,1])
            ycord2.append(dataArr[i,2])
    fig = plt.figure()
    # Add subplot
    ax = fig.add_subplot(111)
    # Draw positive sample
    ax.scatter(xcord1,ycord1,s = 20,c = 'red',marker = 's',alpha=.5)
    # Draw negative sample
    ax.scatter(xcord2,ycord2,s = 20,c = 'green',alpha=.5)
    # Drawing title
    plt.title('DataSet')
    # Drawing label
    plt.xlabel('x')
    plt.ylabel('y')
    # display
    plt.show()

if __name__ == '__main__':
    plotDataSet()
    dataMat,labelMat = loadDataSet()
    print(dataMat,labelMat)

The data distribution can be seen from the above figure. If the input of the Sigmoid function is recorded as Z, then z=w0x0 + w1x1 + w2x2, the data can be divided. Where, x0 is the vector of 1, X1 is the first column of data set, and X2 is the second column of data set. If z=0, then 0=w0 + w1x1 + w2x2. The abscissa is X1 and the ordinate is x2. The unknown parameters of this equation are w0, w1, w2, which is the regression coefficient (optimal parameter) we need to find.

2 gradient up training algorithm

According to the iterative formula of gradient rise:
In this paper, the author analyzes the characteristics of
θj:=θ+αXT(Y−g(Xθ))\bf\theta_j:=\theta+αX^T(Y-g(X\theta))θj​:=θ+αXT(Y−g(Xθ))

#!/user/bin/env python
# -*- coding:utf-8 -*-
#@Time  : 2020/3/16 9:51
#@Author: fangyuan
#@File: logistic regression gradient rise.py


import numpy as np

def loadDataSet():
    # Create dataset
    dataMat = []
    # Create label list
    labelMat = []
    # Open file
    fr = open('Logistic')
    # Read by line
    for line in fr.readlines():
        # Go to enter and put it in the list
        lineArr = line.strip().split()
        # Add data
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        # Add tags
        labelMat.append(int(lineArr[2]))
    # Close file
    fr.close()
    # Return
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))

def gradAscent(dataMatIn,classLabels):
    # mat converted to numpy
    dataMatrix = np.mat(dataMatIn)
    # Convert to numpy's mat and transpose
    labelMat = np.mat(classLabels).transpose()
    # Returns the size of dataMatrix. m is the number of rows and n is the number of columns
    m,n = np.shape(dataMatrix)
    # The mobile step size, which is the learning rate, controls the magnitude of the update
    alpha = 0.001
    # Maximum number of iterations
    maxCycles = 500
    weights = np.ones((n,1))
    # Vectorization formula of gradient rise
    for k in range(maxCycles):
        h = sigmoid(dataMatrix * weights)
        error = (labelMat - h)
        weights = weights + alpha * dataMatrix.transpose() * error
    # Convert matrix to array, return weight array (optimal parameter)
    return weights.getA()

if __name__ == '__main__':
    dataMat,labelMat = loadDataSet()
    print(gradAscent(dataMat,labelMat))

The regression coefficients [w0,w1,w2] have been solved. By solving the parameters, we can determine the separation line between different types of data and draw the decision boundary.

3 draw decision boundary

#!/user/bin/env python
# -*- coding:utf-8 -*-
#@Time  : 2020/3/16 10:06
#@Author: fangyuan
#@File: logistic regression drawing decision boundary.py

import numpy as np
import matplotlib.pyplot as plt

def loadDataSet():
    # Create dataset
    dataMat = []
    # Create label list
    labelMat = []
    # Open file
    fr = open('Logistic')
    # Read by line
    for line in fr.readlines():
        # Go to enter and put it in the list
        lineArr = line.strip().split()
        # Add data
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        # Add tags
        labelMat.append(int(lineArr[2]))
    # Close file
    fr.close()
    # Return
    return dataMat,labelMat

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))

def gradAscent(dataMatIn,classLabels):
    # mat converted to numpy
    dataMatrix = np.mat(dataMatIn)
    # Convert to numpy's mat and transpose
    labelMat = np.mat(classLabels).transpose()
    # Returns the size of dataMatrix. m is the number of rows and n is the number of columns
    m,n = np.shape(dataMatrix)
    # The mobile step size, which is the learning rate, controls the magnitude of the update
    alpha = 0.001
    # Maximum number of iterations
    maxCycles = 500
    weights = np.ones((n,1))
    # Vectorization formula of gradient rise
    for k in range(maxCycles):
        h = sigmoid(dataMatrix * weights)
        error = (labelMat - h)
        weights = weights + alpha * dataMatrix.transpose() * error
    # Convert matrix to array, return weight array (optimal parameter)
    return weights.getA()

def plotBestFit(weights):
    # Load dataset
    dataMat,labelMat = loadDataSet()
    # Array array converted to numpy
    dataArr = np.array(dataMat)
    # Number of data
    n = np.shape(dataMat)[0]
    # Positive samples
    xcord1 = []
    ycord1 = []
    # Negative sample
    xcord2 = []
    ycord2 = []
    # Classify by dataset label
    for i in range(n):
        # 1 is a positive sample
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i,1])
            ycord1.append(dataArr[i,2])
        # 0 is a negative sample
        else:
            xcord2.append(dataArr[i,1])
            ycord2.append(dataArr[i,2])
    fig = plt.figure()
    # Add subplot
    ax = fig.add_subplot(111)
    # Draw positive sample
    ax.scatter(xcord1,ycord1,s = 20,c = 'red',marker = 's',alpha =.5)
    # Draw negative sample
    ax.scatter(xcord2,ycord2,s = 20,c = 'green',alpha=.5)
    x = np.arange(-3.0,3.0,0.1)
    y = (-weights[0] - weights[1] * x) / weights[2]
    ax.plot(x,y)
    # Drawing title
    plt.title('BestFit')
    # Drawing label
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()

if __name__ == '__main__':
    dataMat,labelMat = loadDataSet()
    weights = gradAscent(dataMat,labelMat)
    print(weights)
    plotBestFit(weights)

The purpose of Logistic regression is to find the best fitting parameters of a nonlinear function Sigmoid, and the solution process can be completed by the optimization algorithm.

Published 8 original articles, won praise 0, visited 101
Private letter follow

Tags: Python less Mobile network

Posted on Sun, 15 Mar 2020 21:53:50 -0700 by zaneosak