The main content of this paper is to refer to Tang Yudi's machine learning course, according to the understanding of the course learning and the code implementation of imitating the course content. Now there is a data set, which has four-dimensional characteristics and two classifications. It requires machine learning to iterate the parameters of the model.

## mathematical model

Logistic regression is an extension of linear regression, in which an intermediate function is inserted into the linear regression model to convert the fitted regression model into a probability value and obtain different types of probability results.

Regression model:

The offset term can be used as the parameter of X0 with the default value of 1. When processing the data set, you need to add a column of data characteristics with all values of 1. The optimal solution of general linear regression makes the error term between the real value and the predicted value reach the minimum, so as to get the optimal solution of parameter θ, which is the principle of least square method. In logical regression, Sigmoid function is introduced, and the independent variable value is any real number, and the value range [0,1]

We can get a prediction value in linear regression, and then map the value to the Sigmoid function, which completes the conversion from value to probability, that is, classification task

At this point, the logistic regression prediction model is transformed into:

If the prediction function represents the probability value, the integrated prediction model represents the probability of each sample classification under the current θ condition. Therefore, the optimal solution of θ corresponds to the maximum of all sample probabilities. The likelihood function is obtained by multiplying the prediction model. The maximum solution of the likelihood function is the optimal solution of the parameter.

The solution process is an iterative process. After setting an initial θ parameter, the partial derivative of each parameter is calculated. Combining with all partial derivatives, the direction under the current set parameter is obtained. In the current direction, the parameter is updated with a certain step size, and then the above work is repeated. In such an iterative process, the optimal solution is approached continuously.

## code implementation

The project implementation process can be summarized as follows:

1. sigmoid function: a function mapped to probability

2. model: return forecast result value

3. cost: calculate the loss value according to the parameters

4. Gradient: calculate the gradient direction of each parameter

5. descent: update parameters

6. accuracy: calculation precision

Reference required modules

import pandas as pd import numpy as np import time from matplotlib import pyplot as plt

sigmoid function

#Define intermediate function from value to probability def sigmoid(z): return 1 / (1 + np.exp(-z))

Predicted result value

#Define the prediction result value, x, Xita are arrays, X is full sample characteristic data, Xita is model parameter def model(X,Xita): return sigmoid(np.dot(X,Xita.T))

Loss function

#The loss function is defined. The likelihood function takes the logarithm and then takes the minus sign. In the iterative process, the function will gradually become smaller #X is characteristic data, y is label data, Xita is model parameter, they are all arrays def cost(X,y,Xita): left = np.multiply(-y,np.log(model(X,Xita))) right = np.multiply(1-y,np.log(1-model(X,Xita))) return np.sum(left - right) / len(X)

Define parameter gradient descent direction

#Define the gradient descent direction of the parameter def gradient(X,y,Xita): #Initialize gradient vector grad = np.zeros(Xita.shape) error = (model(X,Xita) - y).ravel() for j in range(len(Xita.ravel())): term = np.multiply(error,X[:,j]) grad[j] = np.sum(term) / len(X) return grad

Other functions:

#Type 0: iterations meet the condition, stop stop_iter = 0 #Type 1: loss value less than threshold, stop stop_cost = 1 #Type 2: gradient less than threshold, stop stop_grad = 2 #The method of iteration stop. The input parameter value is the value generated by the calculation process, and the threshold is the judgment boundary of the input def stopjudgement(type,value,threshold): if type == stop_iter: return value > threshold elif type == stop_cost: return abs(value[-1]-value[-2]) < threshold elif type == stop_grad: #To find the square sum of a matrix and then to open the root sign, the module of a matrix return np.linalg.norm(value) < threshold # Define data shuffle def datashuffle(data): np.random.shuffle(data) col = data.shape[1] AX = data[:,0:col-1] AY = data[:,col-1] return AX,AY

Parameter updating, iterative solution process

#batchsize is the number of samples selected when calculating the gradient direction, threshold is the threshold to judge the end of the cycle, and alpha is the step size def descent(data,xita,batchsize,stoptype,threshold,alpha): start = time.time() X,y = datashuffle(data) i = 0 #Iteration times k = 0 #batch mycost = [cost(X,y,xita)] while True: #The selection of k value is related to the accuracy of the model. Generally, there is a decline of batch gradients. The whole sample is selected to solve the gradients, and the accuracy of time is high #It is fast to select only one sample to calculate the gradient at a time, but not always in the direction of convergence #Batch gradient decline, select a part, more practical grad = gradient(X[k:k+batchsize],y[k:k+batchsize],xita) k += batchsize if k >= len(data): k = 0 X,y = datashuffle(data) #Parameter modification must use minus sign xita = xita - alpha * grad mycost.append(cost(X,y,xita)) i += 1 if stoptype == stop_iter: value = i elif stoptype == stop_cost: value = mycost elif stoptype == stop_grad: value = grad if stopjudgement(stoptype,value,threshold): break end = time.time() spend = end - start return xita, i-1 ,mycost ,grad ,spend

You can customize the step size, batch size, and select the iteration stop strategy to get the final result. The choice of different parameters will affect the process of machine learning, and the performance of the model will be different. How to choose more can be tried several times, the results can be expressed intuitively by graphics, and the best parameter combination scheme can be selected. Here is one of the results

The graph shows that the loss function changes with the number of iterations, and the model is optimal when the loss function reaches the minimum value. Practical experience shows that the more iterations, the better model capability. When we don't know how many iterations are enough, blind selection of fixed iterations as the stop threshold will not be considered. At this time, we can choose the change of loss function and gradient as reference, and stop when it is less than the threshold. When batchsize is selected as 1, the loss function is very unstable, not necessarily in the direction of convergence, so it is better to choose a small batch of batchsize with both performance and accuracy. Learning step length will also have an impact, so-called can't eat a fat man, naturally choose a smaller learning rate will be better.

## Written in the back

This is just a simple logic regression implementation, as a machine learning process of the underlying principles of computer operation. The complete application of logistic regression also needs data preprocessing and cross validation. The evaluation indexes of the model are only from the accuracy, and the other evaluation indexes are generalization ability and recall rate. sklearn, a common machine learning third-party library, has many modules and methods that can be called directly. The purpose of this article is only to help understand the complete process of logical regression and the principle of algorithm implementation.