Tutorial of PyTorch generation countermeasure network (DCGAN)

To read the illustrated tutorial, go to http://studyai.com/pytorch-1.4/beginner/dcgan_faces_tutorial.html

This tutorial introduces DCGANs with an example. We will train a general adversarial network (GAN) to generate new celebrities after showing it many photos of celebrities. Most of the code here comes from the implementation of Python / examples. This document will explain the implementation in detail and explain how and why the model works. But don't worry, you don't need to know GANs in advance, but it may take some time for the first time to reason about what really happened below the surface. In addition, having one or two GPU s may be a good thing for time. Let's start from scratch.

Generate countermeasure network

What is GAN?

GANs is a framework that teaches DL models to capture the distribution of training data so that we can generate new data from the same distribution. GANs was invented by Ian Goodfellow in 2014 and was first described in the paper general advanced nets. They are composed of two different models, one is generator, the other is discriminator. The generator's job is to generate "fake" images that look like training images. The job of the discriminator is to view the image and output whether it is a real training image or a false image from the generator. In the process of training, the generator constantly tries to surpass the discriminator by generating more and more false images, and the discriminator is trying to become a better detective and classify the true and false images correctly. The equilibrium of the game is that when the generator generates a perfect illusion that looks like it comes directly from the training data, the discriminator always guesses with 50% confidence whether the output of the generator is true or not.

Now, let's start with the discriminator and define some symbols to use throughout the tutorial. Suppose x Is the data representing the image. D(x) is a discriminator network, which outputs x from the training data rather than the (scalar) probability of the generator. Here, since we are dealing with images, the input of D(x) is an image with an HWC size of 3x64x64. Intuitively, when x comes from training data, D(x) should be high, and when x comes from generator, D(x) should be low. D(x)

It can also be regarded as a traditional binary classifier.

For the representation of generator, set z Is the potential space vector sampled from the standard normal distribution. G(z) represents the generating function, which maps the potential vector Z to the data space. The goal of G is to estimate the distribution of training data (pdata) so as to obtain the distribution (pg

)False samples are generated in.

Therefore, D(G(z)) Is the probability (scalar) that the image output by generator g is a real image. As described in Goodfellow's paper, D and G play a minimax game, in which D tries to maximize the probability of its correct classification of real and false images (logD(x)), and G tries to minimize the probability of D predicting that its output is false (log(1 − D(G(x)))

) The loss function of GAN is given minGmaxDV(D,G)=Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(x)))]

In theory, the solution of this minimax game is pg=pdata

The discriminator can only guess whether the input is true or false at random. However, the convergence theory of GANS is still under active research, and in reality, the model does not always train to this point.

What is DCGAN?

DCGAN is a direct extension of the above-mentioned GANs, but it explicitly uses convolution and convolution transpose layers in discriminator and generator respectively. Firstly, Radford proposed an unsupervised representation learning method based on deep convolution generation adversary networks in the paper unsupervised representation learning with deep convolution generation adversary networks. The discriminator is composed of a stepped volume layers, a batch norm layers and a LeakyReLU activation function. The input is a 3x64x64 image, and the output is a scalar probability of the input from the real data distribution. The generator consists of a convolutional transfer layer, a batch normalization layer and a ReLU activation layer. Input is the potential vector Z extracted from the standard normal distribution

, the output is an RGB image of 3x64x64. The Striped conv transform layers allow the potential vector to be transformed into a shape with the same shape as the image. The author also gives some tips on how to set up the optimizer, how to calculate the loss function and how to initialize the weight of the model, which will be explained in the following chapters.

from __future__ import print_function
#%matplotlib inline
import argparse
import os
import random
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.optim as optim
import torch.utils.data
import torchvision.datasets as dset
import torchvision.transforms as transforms
import torchvision.utils as vutils
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML

# Set random seem for reproducibility
manualSeed = 999
#manualSeed = random.randint(1, 10000) # use if you want new results
print("Random Seed: ", manualSeed)
random.seed(manualSeed)
torch.manual_seed(manualSeed)

input

Let's define some inputs first:

Dataroot - the path to the root of the dataset folder. We'll talk more about datasets in the next section.
workers - the number of worker threads used to load data with DataLoader.
Batch Ou size - the batch size used in the training. The batch size used by DCGAN is 128.
image_size - the space size of the image used for training. This implementation defaults to 64x64. If another dimension is required, the structure of D and G must be changed. For more details, see here.
nc - number of color channels of input image. Color image is 3-channel.
nz - length of the latent vector
 ngf - relates to the depth of the feature mapping through the generator.
ndf - sets the depth of the feature map propagated through the discriminator.
Num_epics - the number of training rounds (Epics) to run. Long term training may bring better results, but it will take longer.
lr - the learning rate used for training. As recommended in the DCGAN paper, this parameter is set to 0.0002.
Beta1 - beta1 super parameter of the Adam optimizer. As suggested in the DCGAN paper, this parameter is set to 0.5.
ngpu - number of GPUs available. If there is no GPU, the code will run in CPU mode. If you have multiple GPUs, you can speed up the calculation.
# Root directory for dataset
dataroot = "data/celeba"

# Number of workers for dataloader
workers = 2

# Batch size during training
batch_size = 128

# Spatial size of training images. All images will be resized to this
#   size using a transformer.
image_size = 64

# Number of channels in the training images. For color images this is 3
nc = 3

# Size of z latent vector (i.e. size of generator input)
nz = 100

# Size of feature maps in generator
ngf = 64

# Size of feature maps in discriminator
ndf = 64

# Number of training epochs
num_epochs = 5

# Learning rate for optimizers
lr = 0.0002

# Beta1 hyperparam for Adam optimizers
beta1 = 0.5

# Number of GPUs available. Use 0 for CPU mode.
ngpu = 1

data

In this tutorial, we will use the Celeb-A Faces dataset, which can be downloaded from a linked site or from Google drive. The dataset will be downloaded as a file called img_align_celeba.zip. After downloading, create a directory called celeba and extract the zip file into it. Then, set the dataroot input for this notebook to the renarba directory you just created. The resulting directory structure should be:

/path/to/celeba
    -> img_align_celeba
        -> 188242.jpg
        -> 173822.jpg
        -> 284702.jpg
        -> 537394.jpg
           ...

This is an important step because we will use the ImageFolder class, which requires subdirectories in the root folder of the dataset. Now, we can create datasets, dataloader s, set up device operation, and finally visualize some training data.

# We can use an image folder dataset the way we have it setup.
# Create the dataset
dataset = dset.ImageFolder(root=dataroot,
                           transform=transforms.Compose([
                               transforms.Resize(image_size),
                               transforms.CenterCrop(image_size),
                               transforms.ToTensor(),
                               transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
                           ]))
# Create the dataloader
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
                                         shuffle=True, num_workers=workers)

# Decide which device we want to run on
device = torch.device("cuda:0" if (torch.cuda.is_available() and ngpu > 0) else "cpu")

# Plot some training images
real_batch = next(iter(dataloader))
plt.figure(figsize=(8,8))
plt.axis("off")
plt.title("Training Images")
plt.imshow(np.transpose(vutils.make_grid(real_batch[0].to(device)[:64], padding=2, normalize=True).cpu(),(1,2,0)))

Realization

After setting the input parameters and preparing the data set, we can now enter the implementation. We will start with the wigtts initialization strategy, then discuss generator, discriminator, loss function and training cycle in detail. Weight initialization

From the literature of DCGAN, the author points out that the weights of all models should be initialized randomly from the normal distribution of mean = 0, stdev=0.2. The weight function takes the initialization model as the input, and reinitializes all convolution, convolution transpose and batch normalization layers to meet this standard. This function is applied to the model immediately after initialization.

# custom weights initialization called on netG and netD
def weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Conv') != -1:
        nn.init.normal_(m.weight.data, 0.0, 0.02)
    elif classname.find('BatchNorm') != -1:
        nn.init.normal_(m.weight.data, 1.0, 0.02)
        nn.init.constant_(m.bias.data, 0)

Generator (Generator)

Generator G

Is designed to map potential space vectors (z) to the data space. Since our data is an image, converting Z to data space means eventually creating an RGB image of the same size as the training image (i.e. 3x64x64). In practice, this is achieved by a series of stringed 2D revolutionary transfer layers, each layer is paired with a 2d batch norm layer and a relu activation layer. The output of the generator is fed into a tanh function and its output value is compressed to [− 1,1]

The scope of. It is worth noting that batch norm functions are after conv transfer layers, because this is a key contribution of DCGAN paper. These layers contribute to gradient flow during training. The structure of the generator given in the DCGAN article is as follows. dcgan_generator

Notice how the input we set up in the input section (nz, ngf, and nc) affects the generator architecture in the code. nz is the length of the z input vector, ngf is related to the size of the characteristic image propagated by the generator, nc is the number of channels in the output image (set to 3 for RGB image). Here is the code for the generator.

# Generator Code

class Generator(nn.Module):
    def __init__(self, ngpu):
        super(Generator, self).__init__()
        self.ngpu = ngpu
        self.main = nn.Sequential(
            # input is Z, going into a convolution
            nn.ConvTranspose2d( nz, ngf * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(ngf * 8),
            nn.ReLU(True),
            # state size. (ngf*8) x 4 x 4
            nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 4),
            nn.ReLU(True),
            # state size. (ngf*4) x 8 x 8
            nn.ConvTranspose2d( ngf * 4, ngf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 2),
            nn.ReLU(True),
            # state size. (ngf*2) x 16 x 16
            nn.ConvTranspose2d( ngf * 2, ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(True),
            # state size. (ngf) x 32 x 32
            nn.ConvTranspose2d( ngf, nc, 4, 2, 1, bias=False),
            nn.Tanh()
            # state size. (nc) x 64 x 64
        )

    def forward(self, input):
        return self.main(input)

Now we can instantiate the generator and apply the weights? Init function. Look at the printed model to see how the generator object is constructed.

# Create generator object
netG = Generator(ngpu).to(device)

# Handle multi-gpu if desired
if (device.type == 'cuda') and (ngpu > 1):
    netG = nn.DataParallel(netG, list(range(ngpu)))

# All weights are initialized randomly to mean = 0 and STDev = 0.2 by using the weights ﹣ init function
netG.apply(weights_init)

# Printout model
print(netG)

Discriminator

As mentioned above, the discriminator D It takes image as input and output input image as true (not false) scalar probability. Here, D accepts a 3x64x64 input image, processes it through a series of Conv2d, BatchNorm2d and LeakyReLU layers, and outputs the final probability through the sigmoid activation function. You can extend the architecture with more layers if necessary, but it makes sense to use the Striped revolution, BatchNorm, and LeakyReLU. DCGAN's paper mentions that it's a good practice to use the Striped solution instead of pooling to de sample, because it lets the network learn its own pooling function. In addition, the batch norm and leaky relu functions promote healthy gradient flow, which is useful for G and D

The learning process is crucial.

Discriminator Code

class Discriminator(nn.Module):
    def __init__(self, ngpu):
        super(Discriminator, self).__init__()
        self.ngpu = ngpu
        self.main = nn.Sequential(
            # input is (nc) x 64 x 64
            nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf) x 32 x 32
            nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 2),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*2) x 16 x 16
            nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 4),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*4) x 8 x 8
            nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 8),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*8) x 4 x 4
            nn.Conv2d(ndf * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)

Now, like the generator, we can create a discriminator, apply the weights? Init function, and print the structure of the model.

# Create Discriminator
netD = Discriminator(ngpu).to(device)

# Handle multi-gpu if desired
if (device.type == 'cuda') and (ngpu > 1):
    netD = nn.DataParallel(netD, list(range(ngpu)))

# All weights are initialized randomly to mean = 0 and STDev = 0.2 by using the weights ﹣ init function
netD.apply(weights_init)

# Printout model
print(netD)

Loss function and optimizer

When D And G

Once set up, we can specify how they learn through the loss function and optimizer. We will use the binary cross entropy loss (bceloss) function, which is defined in PyTorch as follows: ℓ(x,y)=L={l1,...,lN}⊤,ln=−[yn⋅logxn+(1−yn)⋅log(1−xn)]

Note that this function provides the calculation of two logarithm components in the objective function (i.e. log(D(x)) And log(1 − D (g (z))). We can use y to specify which part of the BCE equation will be calculated. This will be done during the training process and will be covered later. But understand how we can change y

It is very important to choose a part of the loss function that we want to calculate.

Next, we define the true tag as 1 and the false tag as 0. These labels will be used to calculate D And G's loss, which is also the convention used in the original GAN article. Finally, we build two separate optimizers, one for D and one for G. As DCGAN paper points out, both are Adam optimizers with a learning rate of 0.0002 and Beta1=0.5. In order to track the learning process of the generator, we will generate the latent vectors of fixed batches from the Gaussian distribution (i.e. fixed noise). In the training cycle, we will periodically input this fixed noise to G

Medium. During the iteration, we will see the image form from the noise.

# Initialize BCELoss function
criterion = nn.BCELoss()

# Create a batch of late vectors to visualize the progress of the generator
fixed_noise = torch.randn(64, nz, 1, 1, device=device)

# Establish an agreement for true and false labels in the training process
real_label = 1
fake_label = 0

# Set Adam optimizers for G and D
optimizerD = optim.Adam(netD.parameters(), lr=lr, betas=(beta1, 0.999))
optimizerG = optim.Adam(netG.parameters(), lr=lr, betas=(beta1, 0.999))

train

Finally, now that we have defined all parts of the GAN framework, we can train them. Please note that training GANs is an art, because incorrect super parameter settings can cause the mode to crash, and there is little explanation for the cause of the error. Here, we will closely follow the algorithm 1 in Goodfellow's paper, as well as some of the best practices shown in ganhacks. In other words, we will "construct different small batch images for real and false images" and adjust the objective function of G to make logD(G(z))

Maximization. The training is divided into two main parts. Part 1 updates the discriminator, Part 2 updates the generator.

**Part 1 - discriminator**

Recall that the goal of training the discriminator is to maximize the probability that a given input is correctly classified as true or false. We hope to "update the discriminator by lifting its random gradient". In fact, we want to maximize log(D(x))+log(1 − D(G(z))) . Because of the advice from ganhacks' separate Mini batch, we will use two steps to maximize the above calculation process. First, we construct a batch of real samples from the training set, pass D forward, calculate the loss (log(D(x))), and then calculate the backpropagation gradient. Secondly, a batch of fake samples are constructed with the current generator, which is passed forward through D to calculate the loss (log(1 − D(G(z)))

)And transfer the cumulative gradient in reverse. Now, with the gradient accumulated in the full true and full false batch samples, we call the optimizer of the discriminator for further optimization.

**Part 2 - training generator**

As mentioned in the original paper, we hope to minimize log(1 − D(G(z))) To train the generator to produce better fake samples. As mentioned earlier, goodflow does not provide enough gradients, especially early in the learning process. As a fix, we want to maximize log(D(G(z))). In the code, we achieve this through the following methods: use the discriminator in part 1 to classify the output of the generator, use the true tag as the ground truth to calculate the loss of G, then calculate the gradient of G in the backward transmission, and finally update the parameters of G with the step method of the optimizer. Using true tags as GT tags for loss function calculations seems counterintuitive, but this allows us to use the log(x) part of BCELoss (instead of log(1 − x)

That's what we want.

Finally, we will do some statistical reports, and at the end of each epoch, we will push the fixed batch noise into the generator to visually track the training progress of G. The training statistics reported are as follows:

Loss D - discriminator loss, which is the sum of the losses on all true and all false samples (log(D(x))+log(D(G(z)))

). Loss? G - generator loss, using log(D(G(z)))

Calculation.
D(x) - the average output (across batches) of the discriminator on the true samples of all batches. This value should start to approach 1, and then theoretically converge to 0.5 as G gets better. Think about why.
D(G(z)) - the average output of the discriminator on the false samples of all batches. This value should start to approach 0, then converge to 0.5 as the generator gets better. Think about why.

Note: this step may take a while, depending on how many epoch s you want to run and if you remove some data from the dataset.

# Training Loop

# Lists to keep track of progress
img_list = []
G_losses = []
D_losses = []
iters = 0

print("Starting Training Loop...")
# For each epoch
for epoch in range(num_epochs):
    # For each batch in the dataloader
    for i, data in enumerate(dataloader, 0):

        ############################
        # (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
        ###########################
        ## Train with all-real batch
        netD.zero_grad()
        # Format batch
        real_cpu = data[0].to(device)
        b_size = real_cpu.size(0)
        label = torch.full((b_size,), real_label, device=device)
        # Forward pass real batch through D
        output = netD(real_cpu).view(-1)
        # Calculate loss on all-real batch
        errD_real = criterion(output, label)
        # Calculate gradients for D in backward pass
        errD_real.backward()
        D_x = output.mean().item()

        ## Train with all-fake batch
        # Generate batch of latent vectors
        noise = torch.randn(b_size, nz, 1, 1, device=device)
        # Generate fake image batch with G
        fake = netG(noise)
        label.fill_(fake_label)
        # Classify all fake batch with D
        output = netD(fake.detach()).view(-1)
        # Calculate D's loss on the all-fake batch
        errD_fake = criterion(output, label)
        # Calculate the gradients for this batch
        errD_fake.backward()
        D_G_z1 = output.mean().item()
        # Add the gradients from the all-real and all-fake batches
        errD = errD_real + errD_fake
        # Update D
        optimizerD.step()

        ############################
        # (2) Update G network: maximize log(D(G(z)))
        ###########################
        netG.zero_grad()
        label.fill_(real_label)  # fake labels are real for generator cost
        # Since we just updated D, perform another forward pass of all-fake batch through D
        output = netD(fake).view(-1)
        # Calculate G's loss based on this output
        errG = criterion(output, label)
        # Calculate gradients for G
        errG.backward()
        D_G_z2 = output.mean().item()
        # Update G
        optimizerG.step()

        # Output training stats
        if i % 50 == 0:
            print('[%d/%d][%d/%d]\tLoss_D: %.4f\tLoss_G: %.4f\tD(x): %.4f\tD(G(z)): %.4f / %.4f'
                  % (epoch, num_epochs, i, len(dataloader),
                     errD.item(), errG.item(), D_x, D_G_z1, D_G_z2))

        # Save Losses for plotting later
        G_losses.append(errG.item())
        D_losses.append(errD.item())

        # Check how the generator is doing by saving G's output on fixed_noise
        if (iters % 500 == 0) or ((epoch == num_epochs-1) and (i == len(dataloader)-1)):
            with torch.no_grad():
                fake = netG(fixed_noise).detach().cpu()
            img_list.append(vutils.make_grid(fake, padding=2, normalize=True))

        iters += 1

Result

Finally, let's see how we did it. Here, we will see three different results. First, we will see how the loss of D and G in training changes. Second, we will visualize the output of G on a fixed noise batch of each epoch. Third, we will see a batch of real data, next to which is a batch of fake data from G.

Loss versus training iteration

The following is a comparison of the losses of D and G during the iteration.

plt.figure(figsize=(10,5))
plt.title("Generator and Discriminator Loss During Training")
plt.plot(G_losses,label="G")
plt.plot(D_losses,label="D")
plt.xlabel("iterations")
plt.ylabel("Loss")
plt.legend()
plt.show()

Visualization of G's progress

Remember how we saved the generator's output on a fixed noise batch after each training session. Now, we can use animation to visualize G's training progress. Press the play button to start the animation.

#%%capture
fig = plt.figure(figsize=(8,8))
plt.axis("off")
ims = [[plt.imshow(np.transpose(i,(1,2,0)), animated=True)] for i in img_list]
ani = animation.ArtistAnimation(fig, ims, interval=1000, repeat_delay=1000, blit=True)

HTML(ani.to_jshtml())

Real images vs. fake images

Finally, let's look at the real image and the false image!

# Grab a real image of a batch from dataloader
real_batch = next(iter(dataloader))

# Plot the real images
plt.figure(figsize=(15,15))
plt.subplot(1,2,1)
plt.axis("off")
plt.title("Real Images")
plt.imshow(np.transpose(vutils.make_grid(real_batch[0].to(device)[:64], padding=5, normalize=True).cpu(),(1,2,0)))

# Draw a fake image of the last epoch
plt.subplot(1,2,2)
plt.axis("off")
plt.title("Fake Images")
plt.imshow(np.transpose(img_list[-1],(1,2,0)))
plt.show()

Where to go next

Our journey has come to an end, but there are several places you can go from here. You can:

Train longer to see how good the results are
 Modify this model to receive different data sets and possibly changed image sizes and model architecture
 Check out some other cool GAN projects here.
Create a GANs to make music

Tags: network Python IPython Google

Posted on Wed, 11 Mar 2020 04:32:32 -0700 by atulkul