Implementation of logistic regression algorithm for binary classification problems

The main content of this paper is to refer to Tang Yudi's machine learning course, according to the understanding of the course learning and the code implementation of imitating the course content. Now there is a data set, which has four-dimensional characteristics and two classifications. It requires machine learning to iterate the parameters of the model.

mathematical model

Logistic regression is an extension of linear regression, in which an intermediate function is inserted into the linear regression model to convert the fitted regression model into a probability value and obtain different types of probability results.
Regression model:

The offset term can be used as the parameter of X0 with the default value of 1. When processing the data set, you need to add a column of data characteristics with all values of 1. The optimal solution of general linear regression makes the error term between the real value and the predicted value reach the minimum, so as to get the optimal solution of parameter θ, which is the principle of least square method. In logical regression, Sigmoid function is introduced, and the independent variable value is any real number, and the value range [0,1]

We can get a prediction value in linear regression, and then map the value to the Sigmoid function, which completes the conversion from value to probability, that is, classification task

At this point, the logistic regression prediction model is transformed into:

If the prediction function represents the probability value, the integrated prediction model represents the probability of each sample classification under the current θ condition. Therefore, the optimal solution of θ corresponds to the maximum of all sample probabilities. The likelihood function is obtained by multiplying the prediction model. The maximum solution of the likelihood function is the optimal solution of the parameter.

The solution process is an iterative process. After setting an initial θ parameter, the partial derivative of each parameter is calculated. Combining with all partial derivatives, the direction under the current set parameter is obtained. In the current direction, the parameter is updated with a certain step size, and then the above work is repeated. In such an iterative process, the optimal solution is approached continuously.

code implementation

The project implementation process can be summarized as follows:
1. sigmoid function: a function mapped to probability
2. model: return forecast result value
3. cost: calculate the loss value according to the parameters
4. Gradient: calculate the gradient direction of each parameter
5. descent: update parameters
6. accuracy: calculation precision
Reference required modules

import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt

sigmoid function

#Define intermediate function from value to probability
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Predicted result value

#Define the prediction result value, x, Xita are arrays, X is full sample characteristic data, Xita is model parameter
def model(X,Xita):
    return sigmoid(np.dot(X,Xita.T))

Loss function

#The loss function is defined. The likelihood function takes the logarithm and then takes the minus sign. In the iterative process, the function will gradually become smaller
#X is characteristic data, y is label data, Xita is model parameter, they are all arrays
def cost(X,y,Xita):
    left = np.multiply(-y,np.log(model(X,Xita)))
    right = np.multiply(1-y,np.log(1-model(X,Xita)))
    return np.sum(left - right) / len(X)

Define parameter gradient descent direction

#Define the gradient descent direction of the parameter
def gradient(X,y,Xita):
    #Initialize gradient vector
    grad = np.zeros(Xita.shape)
    error = (model(X,Xita) - y).ravel()
    for j in range(len(Xita.ravel())):
        term = np.multiply(error,X[:,j])
        grad[j] = np.sum(term) / len(X)
        
    return grad

Other functions:

#Type 0: iterations meet the condition, stop
stop_iter = 0
#Type 1: loss value less than threshold, stop
stop_cost = 1
#Type 2: gradient less than threshold, stop
stop_grad = 2

#The method of iteration stop. The input parameter value is the value generated by the calculation process, and the threshold is the judgment boundary of the input
def stopjudgement(type,value,threshold):
    if type == stop_iter:
        return value > threshold
    elif type == stop_cost:
        return abs(value[-1]-value[-2]) < threshold
    elif type == stop_grad:
        #To find the square sum of a matrix and then to open the root sign, the module of a matrix
        return np.linalg.norm(value) < threshold

# Define data shuffle
def datashuffle(data):
    np.random.shuffle(data)
    col = data.shape[1]
    AX = data[:,0:col-1]
    AY = data[:,col-1]
    return AX,AY

Parameter updating, iterative solution process

#batchsize is the number of samples selected when calculating the gradient direction, threshold is the threshold to judge the end of the cycle, and alpha is the step size
def descent(data,xita,batchsize,stoptype,threshold,alpha):
    start = time.time()
    X,y = datashuffle(data)
    i = 0  #Iteration times
    k = 0  #batch
    mycost = [cost(X,y,xita)]
    while True:
    	#The selection of k value is related to the accuracy of the model. Generally, there is a decline of batch gradients. The whole sample is selected to solve the gradients, and the accuracy of time is high
    	#It is fast to select only one sample to calculate the gradient at a time, but not always in the direction of convergence
    	#Batch gradient decline, select a part, more practical
        grad = gradient(X[k:k+batchsize],y[k:k+batchsize],xita)
        k += batchsize
        if k >= len(data):
            k = 0
            X,y = datashuffle(data)
        #Parameter modification must use minus sign
        xita = xita - alpha * grad
        mycost.append(cost(X,y,xita))
        i += 1
        if stoptype == stop_iter:
            value = i
        elif stoptype == stop_cost:
            value = mycost
        elif stoptype == stop_grad:
            value = grad
        if stopjudgement(stoptype,value,threshold):
            break
    end = time.time()
    spend = end - start
    
    return xita, i-1 ,mycost ,grad ,spend

You can customize the step size, batch size, and select the iteration stop strategy to get the final result. The choice of different parameters will affect the process of machine learning, and the performance of the model will be different. How to choose more can be tried several times, the results can be expressed intuitively by graphics, and the best parameter combination scheme can be selected. Here is one of the results

The graph shows that the loss function changes with the number of iterations, and the model is optimal when the loss function reaches the minimum value. Practical experience shows that the more iterations, the better model capability. When we don't know how many iterations are enough, blind selection of fixed iterations as the stop threshold will not be considered. At this time, we can choose the change of loss function and gradient as reference, and stop when it is less than the threshold. When batchsize is selected as 1, the loss function is very unstable, not necessarily in the direction of convergence, so it is better to choose a small batch of batchsize with both performance and accuracy. Learning step length will also have an impact, so-called can't eat a fat man, naturally choose a smaller learning rate will be better.

Written in the back

This is just a simple logic regression implementation, as a machine learning process of the underlying principles of computer operation. The complete application of logistic regression also needs data preprocessing and cross validation. The evaluation indexes of the model are only from the accuracy, and the other evaluation indexes are generalization ability and recall rate. sklearn, a common machine learning third-party library, has many modules and methods that can be called directly. The purpose of this article is only to help understand the complete process of logical regression and the principle of algorithm implementation.

Published 1 original article · praised 0 · visited 7
Private letter follow

Tags: less

Posted on Sat, 07 Mar 2020 01:38:32 -0800 by ashben