Two solutions to over fitting

This paper mainly studies the book of dive into DL pytorch. So most of the content of this blog comes from this book. The framework uses pytorch, and the development tool is pycharm
Refer to hands-on deep learning
Reference link

Method one: weight attenuation (L2 norm regularization)

L2 norm regularization is added to the original loss function of the model. L2 norm penalty term refers to the sum of squares of each element of model weight parameter multiplied by a positive constant.
Take the loss function of linear regression as an example: the original loss function

After adding L2 norm penalty term, the loss function is:
The super parameter λ > 0. When the weight parameter is 0, the penalty term is the smallest, and when λ is large, the penalty term will have a greater impact on the loss function, which will generally make the elements of the learned weight parameter closer to 0. ∥ w ∥ 2 = w1w1w1 + w2w2. In small batch random gradient descent, the original iterative method of linear regression weight is
Now the iteration method changes to
L2 norm regularization makes the weight claim to be less than 1, and then subtract the gradient without penalty term. So L2 norm regularization is also called weight decay. By punishing the model parameters with larger absolute value, the weight attenuation increases the limit for the model to be learned. (you can also add the sum of squares of deviation elements to the penalty item)
High dimensional linear regression example:
Set the characteristic dimension of data sample as p. for any sample in training set and test set, generate the tag with the following functions

    title={Dive into Deep Learning},
    author={Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J. Smola},
import torch
import torch.nn as nn
import numpy as np
import sys
import matplotlib.pyplot as plt
from IPython import display
#Generate data
n_train, n_test, num_inputs = 20, 100, 200
true_w, true_b = torch.ones(num_inputs, 1) * 0.01, 0.05
features = torch.randn((n_train + n_test, num_inputs))
labels = torch.matmul(features, true_w) + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)
train_features, test_features = features[:n_train, :], features[n_train:, :]
train_labels, test_labels = labels[:n_train], labels[n_train:]
#Initialize model parameters
def init_params():
    w = torch.randn((num_inputs, 1), requires_grad=True)
    b = torch.zeros(1, requires_grad=True)
    return [w, b]
#Define L2 norm penalty
def l2_penalty(w):
    return (w**2).sum() / 2
def linreg(X, w, b):
    return, w) + b

#loss function
def squared_loss(y_hat, y):
    # Note that the vector is returned here. In addition, mselos in Python is not divided by 2
    return ((y_hat - y.view(y_hat.size())) ** 2) / 2

#Define training and testing
batch_size, num_epochs, lr = 1, 100, 0.003
net, loss = linreg, squared_loss
dataset =, train_labels)
train_iter =, batch_size, shuffle=True)
#Drawing graphics
def set_figsize(figsize=(3.5, 2.5)):
    # Set the size of the drawing
    plt.rcParams['figure.figsize'] = figsize
def semilogy(x_vals, y_vals, x_label, y_label, x2_vals=None, y2_vals=None,
             legend=None, figsize=(3.5, 2.5)):
    plt.semilogy(x_vals, y_vals)
    if x2_vals and y2_vals:
        plt.semilogy(x2_vals, y2_vals, linestyle=':')
def fit_and_plot_pytorch(wd):
#When constructing the optimizer instance, the weight decay super parameter is specified by the weight ﹣ decay parameter.
#By default, pytorch attenuates both weights and deviations. You can construct optimizer instances for weights and biases, so that only weights are attenuated
    net = nn.Linear(num_inputs, 1)
    nn.init.normal_(net.weight, mean=0, std=1)
    nn.init.normal_(net.bias, mean=0, std=1)
    optimizer_w = torch.optim.SGD(params=[net.weight], lr=lr, weight_decay=wd) # Attenuation of weight parameters
    optimizer_b = torch.optim.SGD(params=[net.bias], lr=lr)  # Non deviation parameter attenuation
    train_ls, test_ls = [], []
    for _ in range(num_epochs):
        for X, y in train_iter:
            l = loss(net(X), y).mean()
            #Call step function on two optimizer instances to update weight and deviation respectively
        train_ls.append(loss(net(train_features), train_labels).mean().item())
        test_ls.append(loss(net(test_features), test_labels).mean().item())
    semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'loss',
                 range(1, num_epochs + 1), test_ls, ['train', 'test'])
    print('L2 norm of w:',

Two results, one does not use weight attenuation, and over fitting phenomenon appears. The other uses weight attenuation. Although the training error increases, the error decreases on the test set

L2 norm of w: 13.700491905212402
L2 norm of w: 0.0327015146613121

Process finished with exit code 0

Result chart comparison

Method 2 drop out

When the discard method is used for the hidden layer, the hidden units of the layer (the hidden layer shown in the figure has 5 hidden units). If the discarding probability is p, then the probability of P will be cleared for Hi, and the probability of 1-p for hi will be divided by 1-p for stretching. Discard probability is a super parameter of discard method. In the training, the neural units of hidden layer are randomly discarded, which plays the role of regularization to deal with over fitting. However, when testing the model, in order to get more certain results, the discard method is generally not applicable.
The data set used in the example is the fashion MNIST data set. It is defined that the output number of two hidden layers is 256

    title={Dive into Deep Learning},
    author={Aston Zhang and Zachary C. Lipton and Mu Li and Alexander J. Smola},
import torch
import torchvision
import torch.nn as nn
import numpy as np
import sys
def dropout(X, drop_prob):
    X = X.float()
    assert 0 <= drop_prob <= 1
    keep_prob = 1 - drop_prob
    #In this case, all elements are discarded
    if keep_prob == 0:
        return torch.zeros_like(X)
    mask = (torch.randn(X.shape) < keep_prob).float()
    return mask * X / keep_prob

#Data loading
def load_data_fashion_mnist(batch_size, resize=None, root='~/Datasets/FashionMNIST'):
    """Download the fashion mnist dataset and then load into memory."""
    trans = []
    if resize:

    transform = torchvision.transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)
    mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)
    if sys.platform.startswith('win'):
        num_workers = 0  # 0 means no extra process is needed to speed up data reading
        num_workers = 4
    train_iter =, batch_size=batch_size, shuffle=True, num_workers=num_workers)
    test_iter =, batch_size=batch_size, shuffle=False, num_workers=num_workers)

    return train_iter, test_iter
#Define model parameters
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
W1 = torch.tensor(np.random.normal(0, 0.01, size=(num_inputs, num_hiddens1)), dtype=torch.float, requires_grad=True)
b1 = torch.zeros(num_hiddens1, requires_grad=True)
W2 = torch.tensor(np.random.normal(0, 0.01, size=(num_hiddens1, num_hiddens2)), dtype=torch.float, requires_grad=True)
b2 = torch.zeros(num_hiddens2, requires_grad=True)
W3 = torch.tensor(np.random.normal(0, 0.01, size=(num_hiddens2, num_outputs)), dtype=torch.float, requires_grad=True)
b3 = torch.zeros(num_outputs, requires_grad=True)
params = [W1, b1, W2, b2, W3, b3]

#Definition model
drop_prob1, drop_prob2 = 0.2, 0.5
def net(X, is_training=True):
    X = X.view(-1, num_inputs)
    H1 = (torch.matmul(X, W1) + b1).relu()
    if is_training:  # Use discard only when training models
       H1 = dropout(H1, drop_prob1)  # Add a discard layer after the first layer is fully connected
    H2 = (torch.matmul(H1, W2) + b2).relu()
    if is_training:
        H2 = dropout(H2, drop_prob2)  # Add drop layer after full connection of the second layer
    return torch.matmul(H2, W3) + b3

#Model assessment
def evaluate_accuracy(data_iter, net):
    acc_sum, n = 0.0, 0
    for X, y in data_iter:
        if isinstance(net, torch.nn.Module):
            net.eval() #Evaluation mode, which turns dropout off
            acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
            net.train() # Change back to training mode
        else: # Custom model
           if('is_training' in net.__code__.co_varnames): #If you have the parameter is "training
               #Set is? Training to False
               acc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item()
               acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
        n += y.shape[0]
    return acc_sum / n

#Training and testing models
num_epochs, lr, batch_size = 5, 100.0, 256
loss = torch.nn.CrossEntropyLoss()
#Definition training
#Random gradient descent
def sgd(params, lr, batch_size):
    for param in params: -= lr * param.grad / batch_size  # Note the used when changing param here

def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
              params=None, lr=None, optimizer=None):
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
        for X, y in train_iter:
            y_hat = net(X)
            l = loss(y_hat, y).sum() #Loss function sum of each small sample (batch size)

            # Gradient clearing
            if optimizer is not None:
            elif params is not None and params[0].grad is not None:
                for param in params:
            #Gradient clearing of each parameter

            if optimizer is None:
                sgd(params, lr, batch_size)
                optimizer.step()  # The section "simple implementation of softmax regression" will use

            train_l_sum += l.item()
            train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
            n += y.shape[0]
        test_acc = evaluate_accuracy(test_iter, net)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
              % (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc)) #Average accuracy of training, accuracy of test data
train_iter, test_iter = load_data_fashion_mnist(batch_size)
train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params, lr)

Result output

epoch 1, loss 0.0042, train acc 0.583, test acc 0.721
epoch 2, loss 0.0022, train acc 0.791, test acc 0.776
epoch 3, loss 0.0019, train acc 0.826, test acc 0.811
epoch 4, loss 0.0017, train acc 0.842, test acc 0.832
epoch 5, loss 0.0016, train acc 0.852, test acc 0.822

Process finished with exit code 0
Published 14 original articles, won praise 0, visited 441
Private letter follow

Tags: github Pycharm less IPython

Posted on Wed, 12 Feb 2020 04:41:51 -0800 by baw