# Two solutions to over fitting

This paper mainly studies the book of dive into DL pytorch. So most of the content of this blog comes from this book. The framework uses pytorch, and the development tool is pycharm
Refer to hands-on deep learning
https://github.com/zergtant/pytorch-handbook

## Method one: weight attenuation (L2 norm regularization)

L2 norm regularization is added to the original loss function of the model. L2 norm penalty term refers to the sum of squares of each element of model weight parameter multiplied by a positive constant.
Take the loss function of linear regression as an example: the original loss function After adding L2 norm penalty term, the loss function is: The super parameter λ > 0. When the weight parameter is 0, the penalty term is the smallest, and when λ is large, the penalty term will have a greater impact on the loss function, which will generally make the elements of the learned weight parameter closer to 0. ∥ w ∥ 2 = w1w1w1 + w2w2. In small batch random gradient descent, the original iterative method of linear regression weight is Now the iteration method changes to L2 norm regularization makes the weight claim to be less than 1, and then subtract the gradient without penalty term. So L2 norm regularization is also called weight decay. By punishing the model parameters with larger absolute value, the weight attenuation increases the limit for the model to be learned. (you can also add the sum of squares of deviation elements to the penalty item)
High dimensional linear regression example:
Set the characteristic dimension of data sample as p. for any sample in training set and test set, generate the tag with the following functions ```'''
//Quote
@book{zhang2019dive,
title={Dive into Deep Learning},
author={Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J. Smola},
note={\url{http://www.d2l.ai}},
year={2020}
}
'''
import torch
import torch.nn as nn
import numpy as np
import sys
import matplotlib.pyplot as plt
sys.path.append("..")
from IPython import display
#Generate data
n_train, n_test, num_inputs = 20, 100, 200
true_w, true_b = torch.ones(num_inputs, 1) * 0.01, 0.05
features = torch.randn((n_train + n_test, num_inputs))
labels = torch.matmul(features, true_w) + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)
train_features, test_features = features[:n_train, :], features[n_train:, :]
train_labels, test_labels = labels[:n_train], labels[n_train:]
#Initialize model parameters
def init_params():
return [w, b]
#Define L2 norm penalty
def l2_penalty(w):
return (w**2).sum() / 2
#Model
def linreg(X, w, b):

#loss function
def squared_loss(y_hat, y):
# Note that the vector is returned here. In addition, mselos in Python is not divided by 2
return ((y_hat - y.view(y_hat.size())) ** 2) / 2

#Define training and testing
batch_size, num_epochs, lr = 1, 100, 0.003
net, loss = linreg, squared_loss
dataset = torch.utils.data.TensorDataset(train_features, train_labels)
#Drawing graphics
def set_figsize(figsize=(3.5, 2.5)):
display.set_matplotlib_formats('svg')
# Set the size of the drawing
plt.rcParams['figure.figsize'] = figsize
def semilogy(x_vals, y_vals, x_label, y_label, x2_vals=None, y2_vals=None,
legend=None, figsize=(3.5, 2.5)):
set_figsize(figsize)
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.semilogy(x_vals, y_vals)
if x2_vals and y2_vals:
plt.semilogy(x2_vals, y2_vals, linestyle=':')
plt.legend(legend)
plt.show()
def fit_and_plot_pytorch(wd):
#When constructing the optimizer instance, the weight decay super parameter is specified by the weight ﹣ decay parameter.
#By default, pytorch attenuates both weights and deviations. You can construct optimizer instances for weights and biases, so that only weights are attenuated
net = nn.Linear(num_inputs, 1)
nn.init.normal_(net.weight, mean=0, std=1)
nn.init.normal_(net.bias, mean=0, std=1)
optimizer_w = torch.optim.SGD(params=[net.weight], lr=lr, weight_decay=wd) # Attenuation of weight parameters
optimizer_b = torch.optim.SGD(params=[net.bias], lr=lr)  # Non deviation parameter attenuation
train_ls, test_ls = [], []
for _ in range(num_epochs):
for X, y in train_iter:
l = loss(net(X), y).mean()
l.backward()
#Call step function on two optimizer instances to update weight and deviation respectively
optimizer_w.step()
optimizer_b.step()
train_ls.append(loss(net(train_features), train_labels).mean().item())
test_ls.append(loss(net(test_features), test_labels).mean().item())
semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'loss',
range(1, num_epochs + 1), test_ls, ['train', 'test'])
print('L2 norm of w:', net.weight.data.norm().item())
fit_and_plot_pytorch(0)
fit_and_plot_pytorch(3)

```

Two results, one does not use weight attenuation, and over fitting phenomenon appears. The other uses weight attenuation. Although the training error increases, the error decreases on the test set

```L2 norm of w: 13.700491905212402
L2 norm of w: 0.0327015146613121

Process finished with exit code 0
```

Result chart comparison  ## Method 2 drop out When the discard method is used for the hidden layer, the hidden units of the layer (the hidden layer shown in the figure has 5 hidden units). If the discarding probability is p, then the probability of P will be cleared for Hi, and the probability of 1-p for hi will be divided by 1-p for stretching. Discard probability is a super parameter of discard method. In the training, the neural units of hidden layer are randomly discarded, which plays the role of regularization to deal with over fitting. However, when testing the model, in order to get more certain results, the discard method is generally not applicable. The data set used in the example is the fashion MNIST data set. It is defined that the output number of two hidden layers is 256

```'''
//Quote
@book{zhang2019dive,
title={Dive into Deep Learning},
author={Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J. Smola},
note={\url{http://www.d2l.ai}},
year={2020}
}
'''
import torch
import torchvision
import torch.nn as nn
import numpy as np
import sys

def dropout(X, drop_prob):
X = X.float()
assert 0 <= drop_prob <= 1
keep_prob = 1 - drop_prob
#In this case, all elements are discarded
if keep_prob == 0:
return mask * X / keep_prob

trans = []
if resize:
trans.append(torchvision.transforms.Resize(size=resize))
trans.append(torchvision.transforms.ToTensor())

transform = torchvision.transforms.Compose(trans)
if sys.platform.startswith('win'):
num_workers = 0  # 0 means no extra process is needed to speed up data reading
else:
num_workers = 4
train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=num_workers)

return train_iter, test_iter
#Define model parameters
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
W1 = torch.tensor(np.random.normal(0, 0.01, size=(num_inputs, num_hiddens1)), dtype=torch.float, requires_grad=True)
W2 = torch.tensor(np.random.normal(0, 0.01, size=(num_hiddens1, num_hiddens2)), dtype=torch.float, requires_grad=True)
W3 = torch.tensor(np.random.normal(0, 0.01, size=(num_hiddens2, num_outputs)), dtype=torch.float, requires_grad=True)
params = [W1, b1, W2, b2, W3, b3]

#Definition model
drop_prob1, drop_prob2 = 0.2, 0.5
def net(X, is_training=True):
X = X.view(-1, num_inputs)
H1 = (torch.matmul(X, W1) + b1).relu()
if is_training:  # Use discard only when training models
H1 = dropout(H1, drop_prob1)  # Add a discard layer after the first layer is fully connected
H2 = (torch.matmul(H1, W2) + b2).relu()
if is_training:
H2 = dropout(H2, drop_prob2)  # Add drop layer after full connection of the second layer

#Model assessment
def evaluate_accuracy(data_iter, net):
acc_sum, n = 0.0, 0
for X, y in data_iter:
if isinstance(net, torch.nn.Module):
net.eval() #Evaluation mode, which turns dropout off
acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
net.train() # Change back to training mode
else: # Custom model
if('is_training' in net.__code__.co_varnames): #If you have the parameter is "training
#Set is? Training to False
acc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item()
else:
acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
n += y.shape
return acc_sum / n

#Training and testing models
num_epochs, lr, batch_size = 5, 100.0, 256
loss = torch.nn.CrossEntropyLoss()
#Definition training
def sgd(params, lr, batch_size):
for param in params:
param.data -= lr * param.grad / batch_size  # Note the param.data used when changing param here

def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
params=None, lr=None, optimizer=None):
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
for X, y in train_iter:
y_hat = net(X)
l = loss(y_hat, y).sum() #Loss function sum of each small sample (batch size)

if optimizer is not None:
elif params is not None and params.grad is not None:
for param in params:

l.backward()
if optimizer is None:
sgd(params, lr, batch_size)
else:
optimizer.step()  # The section "simple implementation of softmax regression" will use

train_l_sum += l.item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
n += y.shape
test_acc = evaluate_accuracy(test_iter, net)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
% (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc)) #Average accuracy of training, accuracy of test data

train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)

```

Result output

```epoch 1, loss 0.0042, train acc 0.583, test acc 0.721
epoch 2, loss 0.0022, train acc 0.791, test acc 0.776
epoch 3, loss 0.0019, train acc 0.826, test acc 0.811
epoch 4, loss 0.0017, train acc 0.842, test acc 0.832
epoch 5, loss 0.0016, train acc 0.852, test acc 0.822

Process finished with exit code 0
```  Published 14 original articles, won praise 0, visited 441

Posted on Wed, 12 Feb 2020 04:41:51 -0800 by baw