Linear Regression &Softmax classification model & multi-layer perceptron

Linear regression after class questions

  1. Full connection layer and I / O shape

If you are implementing a fully connected layer, the input shape of the fully connected layer is 7 × 8, and the output shape is 7 × 1, where 7 is the batch size, then the shape of the weight parameter w and the offset parameter b are respectively____

So the bias parameter B ∈ R1 × 1b ∈ R^{1 × 1}b ∈ R1 × 1. (the shape of the parameter has nothing to do w ith the batch size)

  1. broadcasting semantics
# The return values of. shape and. size() methods are the same
x=torch.randn(4)
>>> x
tensor([-0.0233, -0.4144, -0.5163, -0.8312])
>>> x.shape
torch.Size([4])
>>> x.size()
torch.Size([4])
# y is nx1
>>> y=torch.randn(4)
>>> y
tensor([-1.4736, -0.7209,  0.6472,  1.0759])
>>> y_hat.view(-1).shape # A option is correct
torch.Size([4]) 

# y_hat is 1xn
>>> y_hat=torch.randn(4,1)
>>> y_hat
tensor([[1.2142],
        [1.7063],
        [0.4449],
        [0.9576]])
>>> y.view(-1).shape # Wrong option B: 4x1 for the former and 1x4 for the latter
torch.Size([4])
>>> y.view(y_hat.shape).shape # C option is correct
torch.Size([4, 1])
>>> y.view(-1, 1).shape # D option is correct
torch.Size([4, 1])
  1. In the linear regression model, for a batch with a size of 3, the predicted and real values of the tag are shown in the following table:
y^\hat{y}y^​ yyy
2.33 3.14
1.07 0.98
1.23 1.32

The average value of loss function of this batch is: 0.112

# Use the functions in the tutorial to calculate
>>> import torch
>>> y=torch.tensor([3.14, 0.98, 1.32])
>>> y_hat=torch.tensor([2.33, 1.07, 1.23])
>>> def squared_loss(y_hat, y):
...     return (y_hat - y.view(y_hat.size())) ** 2 / 2
... 
>>> loss=squared_loss(y_hat, y)
>>> loss
tensor([0.3281, 0.0041, 0.0041])
>>> loss.mean()
tensor(0.1121)

The basic concept of Softmax regression


softmax is a single-layer neural network, and the output layer is also a full connection layer
The calculation of each output O1, o2, o3ou1, ou2, o3o 1, o2, o3 depends on all inputs x1, x2, x3, X 4x1, x2, x3, x4.

Transform the output value into a probability distribution with a positive value and a sum of 1:
softmax(o)=exp⁡(o)∑iexp⁡(oi)\text{softmax}(o) = \frac{ \exp(o)}{\sum_i \exp(o_i)}softmax(o)=∑i​exp(oi​)exp(o)​

Single sample and small batch softmax matrix operation

  • . end {aligned} o (i) y ^ (i) = x(i)W+b,=softmax(o(i))
  • Weight and deviation parameters:
    W=[w11w12w13w21w22w23w31w32w33w41w42w43],b=[b1b2b3]\boldsymbol{W} = \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ w_{41} & w_{42} & w_{43} \end{bmatrix},\quad \boldsymbol{b} = \begin{bmatrix} b_1 & b_2 & b_3 \end{bmatrix}W=⎣⎢⎢⎡​w11​w21​w31​w41​​w12​w22​w32​w42​​w13​w23​w33​w43​​⎦⎥⎥⎤​,b=[b1​​b2​​b3​​]
  • Characteristics, output, probability distribution:
    x(i)=[x1(i)x2(i)x3(i)x4(i)],\boldsymbol{x}^{(i)} = \begin{bmatrix}x_1^{(i)} & x_2^{(i)} & x_3^{(i)} & x_4^{(i)}\end{bmatrix},x(i)=[x1(i)​​x2(i)​​x3(i)​​x4(i)​​],
    o(i)=[o1(i)o2(i)o3(i)],\boldsymbol{o}^{(i)} = \begin{bmatrix}o_1^{(i)} & o_2^{(i)} & o_3^{(i)}\end{bmatrix},o(i)=[o1(i)​​o2(i)​​o3(i)​​],
    y^(i)=[y^1(i)y^2(i)y^3(i)].\boldsymbol{\hat{y}}^{(i)} = \begin{bmatrix}\hat{y}_1^{(i)} & \hat{y}_2^{(i)} & \hat{y}_3^{(i)}\end{bmatrix}.y^​(i)=[y^​1(i)​​y^​2(i)​​y^​3(i)​​].
  • Small batch nnn: o = XW + B, y ^ = softmax (o), \ begin {aligned} \ boldsymbol {o} & = \ boldsymbol {x} \ boldsymbol {w} + \ boldsymbol {B}, \ \ \ boldsymbol {\ hat {y}} & = \ text {softmax} (\ boldsymbol {o}), \ end {aligned} oy ^ = XW+b,=softmax(O)
    O. The third line of Y ∈ Rn × 3 \ boldsymbol {o}, \ boldsymbol {\ hat {y}} \ i n \ matchb {r} {n \ times 3} O,Y ∈ Rn × 3 is the output o (i) {o} {(i)} o (i) and probability distribution of the sample.
  • Batch characteristics: X ∈ Rn × 4\boldsymbol{X} \in \mathbb{R}^{n \times 4}X ∈ Rn × 4
  • Weight parameter: W ∈ R4 × 3\boldsymbol{W} \in \mathbb{R}^{4 \times 3}W ∈ R4 × 3
  • Deviation parameter: b ∈ R1 × 3\boldsymbol{b} \in \mathbb{R}^{1 \times 3}b ∈ R1 × 3

For sample iii, the measurement function to measure the difference between two probability distributions: cross entropy

  • H(y(i),y^(i))=−∑j=1qyj(i)log⁡y^j(i)H\left(\boldsymbol y^{(i)}, \boldsymbol {\hat y}^{(i)}\right ) = -\sum_{j=1}^q y_j^{(i)} \log \hat y_j^{(i)}H(y(i),y^​(i))=−∑j=1q​yj(i)​logy^​j(i)​
  • Real label probability distribution: y(i) ∈R3={y1(i),y2(i), * * *, yj(i), * * *, yj(i), * *}boldsymbol{y}^{(I)}in mathbb{R}^{3}={y 1^{(I) mathbb{R}^{3}={y 1^{(y2(i)}in mathbb{R}^{3}={y 1^{(I)}in mathbb{R}^{3}={y 1^{(mathbb{R}^{3}={y 1^{(I)}in, y 2^{(I)}, cdots, y j^{(mathbb{R}^{3}={y 1^{(mathbb{R}^{3}={y 1^{(I) I)}, y 2^{(I)}, cdots, y j^{(Y j^{(I)}, cdots}y(i) (cdots}y (cdots}y(i)) ∈r3====mathbb{R}^{3}={y 1^{(mathbb{R}^{3}={y 1^{(I) mathbb{R}^{3}={y 1^{(I)}in, y 2^{(I)}, cdots, y j^{(I)}, cdots}y(i) (cdots}y (cdots}y(i)) ∈r3===mathbb{R}^{3}={y 1^{(mathbb{R}^{3}={y 1^{(I)}, y 2^{(I)}, cdots, cdots, cdots, y j^{(Y j^{(y j^{(I) (Y j^{(I)
  • Prediction probability distribution: y^(i)={y^1(i)..., y^j(i),} \ boldsymbol {\ hat y} {(I)} = \ {\ hat y {(I)} \ cdots, hat y {(I)}, \ cdots \} y^(i)={y^1(i), y^j(i),}.

If it's a two distribution,

  • y(i)={0,0,… 1,0...  }\boldsymbol y^{(i)}=\{0,0,\dots,1,0\dots\}y(i)={0,0,… 1,0... }(the y(i)y^{(i)}y(i) element is 1)
  • H(y(i),y^(i))=−log⁡y^y(i)j(i)H\left(\boldsymbol y^{(i)}, \boldsymbol {\hat y}^{(i)}\right ) = - \log \hat y_{y^{(i)}}j^{(i)}H(y(i),y^​(i))=−logy^​y(i)​j(i)

When the number of training samples is nnn, the loss function is defined as:

  • ℓ(Θ)=−(1/n)∑i=1nlog⁡y^y(i)(i)\ell(\boldsymbol{\Theta}) = -(1/n) \sum_{i=1}^n \log \hat y_{y^{(i)}}^{(i)}ℓ(Θ)=−(1/n)∑i=1n​logy^​y(i)(i)​
  • Maximize ℓ (Π) ℓ (Π), i.e. minimize exp (− n ℓ (Π)) = Πi=1n y ^ y(i)(i) \ exp (- n \ ell(\boldsymbol{\Theta})) = \ prod {I = 1} ^ n \ hat y {(I)}} {(I)} {exp (− n ℓ (Π)) = Πi=1n y ^ y(i)(i)
    (joint probability distribution of all tags in training dataset)

Fashion Minist training set acquisition

# import needed package
%matplotlib inline
from IPython import display
import matplotlib.pyplot as plt

import torch
import torchvision
import torchvision.transforms as transforms
import time

import sys
sys.path.append("/home/kesci/input") #Catalog? Loading d2l 
import d2lzh1981 as d2l
# download trainset and testset
mnist_train = torchvision.datasets.FashionMNIST(root='/home/kesci/input/FashionMNIST2065', train=True, download=True, transform=transforms.ToTensor())
mnist_test = torchvision.datasets.FashionMNIST(root='/home/kesci/input/FashionMNIST2065', train=False, download=True, transform=transforms.ToTensor())
# api explanation
minist_train = torchvision.datasets.FashionMNIST(
	root = '~/...', 
	train = True, 
	transform = None,
	target_transform = None, 
	download = False
	)
## Root (string) - the root directory of the dataset, which holds the processed/training.pt and processed/test.pt files.
## train (bool, optional) - if set to True, create the dataset from training.pt, otherwise from test.pt.
## Download (bool, optional) - if set to True, download data from the Internet and place it in the root folder. If the data already exists in the root directory, it will not be downloaded again.
## Transform (callable, optional) - a function or transform that inputs a PIL picture and returns the transformed data. For example: transforms.RandomCrop.
## Target? Transform (callable, optional) - a function or transform that inputs a target and transforms it.

Softmax regression from scratch

import torch
import torchvision
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
  • Cross moisture loss function
def cross_entropy(y_hat, y):
    return - torch.log(y_hat.gather(1, y.view(-1, 1)))

torch.tensor.gather()Method: xxx.gather("Fetch by row=1","array")

y_hat = torch.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
# Sample 1 tag probability distribution, sample 2 tag probability distribution
y = torch.LongTensor([0, 2])
# Sample 1 real label: 0, sample 2 real label: 2

y_hat.gather(1, y.view(-1, 1))
# Return the prediction probability of real labels of each sample, sample 1:0.1, sample 2:0.5
  • Get training and test sets
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, root='/home/kesci/input/FashionMNIST2065')
  • Softmax operation and regression model
def softmax(X):
    X_exp = X.exp()
    partition = X_exp.sum(dim=1, keepdim=True)
    # print("X size is ", X_exp.size())
    # print("partition size is ", partition, partition.size())
    return X_exp / partition  # The broadcast mechanism is applied here

def net(X):
    return softmax(torch.mm(X.view((-1, num_inputs)), W) + b)
  • Training model
def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, params=None, lr=None, optimizer=None):
...
# Default Optimization: random gradient drop d2l.sgd(params, lr, batch_size)

# Application: loss = cross ˊ entropy, params = [w, b]
train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, batch_size, [W, b], lr)
Published 1 original article · praised 0 · visited 4
Private letter follow

Tags: network IPython

Posted on Fri, 14 Feb 2020 07:45:51 -0800 by David-fethiye