# Summary

During the epidemic, I found the public welfare course of Boyu education by chance. The whole course system introduced the deep learning system very systematically, but the teaching and use of pytorch were the most attractive because torch was not so good before, it was a bit troublesome to watch the open source program of nlp competition.I started a small project and recorded a 14-day course to consolidate my knowledge.The reference books for the entire course are Deep Learning by Hands
The first note mainly talks about linear regression, softmax and multi-layer perceptors, and presents the principles and engineering practices of the algorithm from two perspectives: non-library version and library version.

# 1. Linear Regression

For a simple example of this knowledge, here we assume that the price depends only on two factors of housing condition, namely, area (square meters) and age (years).Next, we want to explore the specific relationship between prices and these two factors.Linear regression assumes that the output has a linear relationship with each input:
price=warea⋅area+wage⋅age+b \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b price=warea​⋅area+wage​⋅age+b
Then there is the loss function and the optimization method, which lists data generation, network building, loss function and non-library function version of SGD implementation:

• Random Data Generation
This paper describes how to randomly generate random data that can be used to validate a model, so that you can easily test whether your model is valid or not.
# set input feature number
num_inputs = 2
# set example number
num_examples = 1000

# set true weight and bias in order to generate corresponded label
true_w = [2, -3.4]
true_b = 4.2

features = torch.randn(num_examples, num_inputs,
dtype=torch.float32)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()),
dtype=torch.float32)


Visualize randomly generated data

plt.scatter(features[:, 1].numpy(), labels.numpy(), 1);


def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices)  # random read 10 samples
for i in range(0, num_examples, batch_size):
j = torch.LongTensor(indices[i: min(i + batch_size, num_examples)]) # the last time may be not enough for a whole batch
yield  features.index_select(0, j), labels.index_select(0, j)

• Define Model Version
1 Loss function
l(i)(w,b)=12(y^(i)−y(i))2, l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2, l(i)(w,b)=21​(y^​(i)−y(i))2,
2 Optimizing function
(w,b)←(w,b)−η∣B∣∑i∈B∂(w,b)l(i)(w,b) (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b) (w,b)←(w,b)−∣B∣η​i∈B∑​∂(w,b)​l(i)(w,b)
# Parameter Initialization
w = torch.tensor(np.random.normal(0, 0.01, (num_inputs, 1)), dtype=torch.float32)
b = torch.zeros(1, dtype=torch.float32)
# Define Network Structure
def linreg(X, w, b):
# Define loss function
def squared_loss(y_hat, y):
return (y_hat - y.view(y_hat.size())) ** 2 / 2
# Define the optimization function
def sgd(params, lr, batch_size):
for param in params:
param.data -= lr * param.grad / batch_size # ues .data to operate param without gradient track

# model training
# super parameters init
lr = 0.03
num_epochs = 5

net = linreg
loss = squared_loss

# training
for epoch in range(num_epochs):  # training repeats num_epochs times
# in each epoch, all the samples in dataset will be used once

# X is the feature and y is the label of a batch sample
for X, y in data_iter(batch_size, features, labels):
l = loss(net(X, w, b), y).sum()
# calculate the gradient of batch sample loss
l.backward()
# using small batch random gradient descent to iter model parameters
sgd([w, b], lr, batch_size)
train_l = loss(net(features, w, b), labels)
print('epoch %d, loss %f' % (epoch + 1, train_l.mean().item()))

• Concise version
import torch
from torch import nn
import numpy as np
torch.manual_seed(1)

print(torch.__version__)
torch.set_default_tensor_type('torch.FloatTensor')

num_inputs = 2
num_examples = 1000

true_w = [2, -3.4]
true_b = 4.2

features = torch.tensor(np.random.normal(0, 1, (num_examples, num_inputs)), dtype=torch.float)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

import torch.utils.data as Data

batch_size = 10

# combine featues and labels of dataset
dataset = Data.TensorDataset(features, labels)

dataset=dataset,            # torch TensorDataset format
batch_size=batch_size,      # mini batch size
shuffle=True,               # whether shuffle the data or not
)


Here are three ways to model your resume

# ways to init a multilayer network
# method one
net = nn.Sequential(
nn.Linear(num_inputs, 1)
# other layers can be added here
)

# method two
net = nn.Sequential()

# method three
from collections import OrderedDict
net = nn.Sequential(OrderedDict([
('linear', nn.Linear(num_inputs, 1))
# ......
]))


#The model is initialized as follows:
from torch.nn import init

init.normal_(net[0].weight, mean=0.0, std=0.01)
init.constant_(net[0].bias, val=0.0)  # or you can use net[0].bias.data.fill_(0) to modify it directly
#Define loss function
loss = nn.MSELoss()    # nn built-in squared loss function
# function prototype: torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')
#Define the optimization function
import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=0.03)   # built-in random gradient descent function
# function prototype: torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)
# Training Network
num_epochs = 3
for epoch in range(1, num_epochs + 1):
for X, y in data_iter:
output = net(X)
l = loss(output, y.view(-1, 1))
l.backward()
optimizer.step()
print('epoch %d, loss: %f' % (epoch, l.item()))


# 2 softmax

This lesson mainly explains how to make a classification using linear models. This blog mainly records some wheels of torch and some wheels of torchvision.
Here we will use the torchvision package, which serves the PyTorch in-depth learning framework and is mainly used to build computer vision models.Torchvision consists mainly of the following components:

1. torchvision.datasets: Some functions that load data and common dataset interfaces;
2. torchvision.models: Contains commonly used model structures (including pre-training models), such as AlexNet, VGG, ResNet, etc.
3. torchvision.transforms: Common picture transformations, such as cropping, rotation, etc.
4. torchvision.utils: Some other useful methods.

softmax's formula:
y^j=exp⁡(oj)∑i=13exp⁡(oi) o(i)=x(i)W+b,y^(i)=softmax(o(i)). \hat{y}_j = \frac{ \exp(o_j)}{\sum_{i=1}^3 \exp(o_i)} \\ \ \\ \begin{aligned} \boldsymbol{o}^{(i)} &= \boldsymbol{x}^{(i)} \boldsymbol{W} + \boldsymbol{b},\\ \boldsymbol{\hat{y}}^{(i)} &= \text{softmax}(\boldsymbol{o}^{(i)}). \end{aligned} y^​j​=∑i=13​exp(oi​)exp(oj​)​ o(i)y^​(i)​=x(i)W+b,=softmax(o(i)).​

def softmax(X):
X_exp = X.exp()
partition = X_exp.sum(dim=1, keepdim=True)
# print("X size is ", X_exp.size())
# print("partition size is ", partition, partition.size())
return X_exp / partition  # The broadcast mechanism is applied here

def net(X):
return softmax(torch.mm(X.view((-1, num_inputs)), W) + b)
# loss function
def cross_entropy(y_hat, y):
return - torch.log(y_hat.gather(1, y.view(-1, 1)))
# accuracy rate
def accuracy(y_hat, y):
return (y_hat.argmax(dim=1) == y).float().mean().item()


A few small points of knowledge:

• torch.mm is matrix multiplication, torch.mul is matrix dot multiplication, x.view is equivalent to reshape

• gather's api:
torch.gather(input, dim, index, out=None) → Tensor
The dim axis, the shape of the index is the same as the shape of the output.

• x.item() gets the element value of the tensor

# 3-layer Perceptor

3.1 Activation Function
If the output of the hidden layer is directly used as input to the output layer, it can be obtained
O=(XWh+bh)Wo+bo=XWhWo+bhWo+bo. \boldsymbol{O} = (\boldsymbol{X} \boldsymbol{W}_h + \boldsymbol{b}_h)\boldsymbol{W}_o + \boldsymbol{b}_o = \boldsymbol{X} \boldsymbol{W}_h\boldsymbol{W}_o + \boldsymbol{b}_h \boldsymbol{W}_o + \boldsymbol{b}_o. O=(XWh​+bh​)Wo​+bo​=XWh​Wo​+bh​Wo​+bo​.
From the formula above, it can be seen that although the neural network introduces a hidden layer, it is still equivalent to a single layer neural network: the output layer weight parameter is W h W o\boldsymbol{W}_h\boldsymbol{W}_oWh W o, the deviation parameter is b h W o+b o\boldsymbol{b}_h \dsymymbol{W}_o + \boldsymbol b}_o B h+b o.It is not difficult to find that even if more hidden layers are added, the above design can only be equivalent to a single layer neural network with only output layers.

The root of this problem is that the fully connected layer only makes affine transformations on the data, while the overlay of multiple affine transformations is still an affine transformation.One way to solve this problem is to introduce a non-linear transformation, such as a transformation of hidden variables using a non-linear function that operates on elements, and then as input to the next fully connected layer.This non-linear function is called an activation function.

Here we focus on the derivation and selection of the activation function and why y.sum().backward() Still a little vague, hope to be clearer in the future.

• Derivation
x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
y = x.relu()
xyplot(x, y, 'relu')
y.sum().backward()

• Choice
The ReLu function is a general activation function and is currently used in most cases.However, the ReLU function can only be used in a hidden layer.When used in classifiers, sigmoid functions and their combinations generally work better.The sigmoid and tanh functions are sometimes avoided due to the disappearance of gradients.When the number of layers of the neural network is large, it is better to use the ReLu function, which is simpler and less computational, while the sigmoid and tanh functions are much more computational.When choosing the activation function, you can choose the ReLu function first. If the effect is not good, you can try other activation functions.

3.2 Build from scratch

num_inputs, num_outputs, num_hiddens = 784, 10, 256

W1 = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_hiddens)), dtype=torch.float)
b1 = torch.zeros(num_hiddens, dtype=torch.float)
W2 = torch.tensor(np.random.normal(0, 0.01, (num_hiddens, num_outputs)), dtype=torch.float)
b2 = torch.zeros(num_outputs, dtype=torch.float)
params = [W1, b1, W2, b2]

for param in params:

def relu(X):
`