Language model and data set of hands-on deep learning

Language model can be used to evaluate whether the text sequence is reasonable, that is, to calculate the probability of the sequence: P (W1, w2,..., wT) P (W1, w2,..., wT) P (W1, w2,..., wT). Among them, the statistical language model (Markov chain model) is widely used in natural language processing. This paper will introduce the theory of Markov chain and the form of input data set.

Language model

  1. Language model
    If there is a sequence (W1, w2,..., wT) (W1, w2,..., wT) (W1, w2,..., wT), the probability of its occurrence is
    P(w1,w2,...,wT)=∏i=tTP(wt∣w1,w2,...,wt−1)=P(w1)P(w2∣w1)P(w3∣w1w2)...P(wT∣w1w2...wT−1) \begin{aligned} P(w_1,w_2,...,w_T)&=\prod_{i=t}^{T}{P(w_t|w_1,w_2,...,w_{t-1})}\\ &=P(w_1)P(w_2|w_1)P(w_3|w_1w_2)...P(w_T|w_1w_2...w_{T-1}) \end{aligned} P(w1​,w2​,...,wT​)​=i=t∏T​P(wt​∣w1​,w2​,...,wt−1​)=P(w1​)P(w2​∣w1​)P(w3​∣w1​w2​)...P(wT​∣w1​w2​...wT−1​)​
    For a specific corpus, the probability of a word can be calculated by the relative word frequency of the word in the training data set.
  2. n-order Markov chain
    The model proposed above is called n-ary syntax, which has two problems: too large parameter space and sparse data. The former refers to the combination of a new parameter such as W1 W1 and W1 w2 W, so the number of parameters will be very large; the latter refers to the difficulty in finding words that meet the requirements of w1,w2,...,wT − 1W, w2,..., w {T-1} w1,w2,...,wT − 1 in the specific training set, and the word frequency will be very low. To solve this problem, we need to use Markov hypothesis.
    The N-1 Markov chain model is based on the assumption that the current occurrence of this word is only related to the previous n-1 words. For example, when n=2, P(w3 ∣ w1,w2)=P(w3 ∣ w2) P(w3 ゗ w1,w2)=P(w3 ゗ w1,w2)=P(w3 ∣ w2).

Input data set

  1. Read data set
with open('/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt') as f:
    corpus_chars = f.read()
print(len(corpus_chars))
print(corpus_chars[: 40])
corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[: 10000]
"""
63282
 Want a helicopter
 Want to fly to the universe with you
 Want to melt with you
 Melting in the universe
 Every day, every day
"""
  1. Build character index
idx_to_char = list(set(corpus_chars)) # De duplicate to get index to character mapping
char_to_idx = {char: i for i, char in enumerate(idx_to_char)} # Character to index mapping
vocab_size = len(char_to_idx)
print(vocab_size)

corpus_indices = [char_to_idx[char] for char in corpus_chars]  # Turn each character into an index to get a sequence of indexes
sample = corpus_indices[: 20]
print('chars:', ''.join([idx_to_char[idx] for idx in sample]))
print('indices:', sample)
"""
1027
chars: Want to have a helicopter want to fly with you to the universe want and
indices: [1022, 648, 1025, 366, 208, 792, 199, 1022, 648, 641, 607, 625, 26, 155, 130, 5, 199, 1022, 648, 641]
"""

It should be noted that in the previous chapter of text preprocessing, word segmentation, mapping and transformation tool functions have been encapsulated.

Sampling of time series data

In the training process, a small batch of samples and tags are read at random each time. A sample of time series data usually contains consecutive characters, which depends on the setting of time steps. However, for a time series data, when the length is t and the time step number is n, there will be T-n samples, which have a large number of overlaps, so efficient sampling methods are needed. Here are two: random sampling and adjacent sampling.

  1. Random sampling
import torch
import random
def data_iter_random(corpus_indices, batch_size, num_steps, device=None):
    # Minus 1 is because for a sequence of length N, X contains at most the first n - 1 characters
    num_examples = (len(corpus_indices) - 1) // num_steps  # The number of samples without overlapping is obtained by rounding
    example_indices = [i * num_steps for i in range(num_examples)]  # The subscript of the first character of each sample in corpus  indexes
    print(example_indices)
    random.shuffle(example_indices)  #The key to random sampling is here
    print(example_indices)

    def _data(i):
        # Returns a sequence of num steps from i
        return corpus_indices[i: i + num_steps]
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    for i in range(0, num_examples, batch_size):
        print(num_examples,batch_size,i)
        # Each time, select batch_size random samples
        batch_indices = example_indices[i: i + batch_size]  # Subscript of the first character of each sample of the current batch
        X = [_data(j) for j in batch_indices]
        Y = [_data(j + 1) for j in batch_indices]
        yield torch.tensor(X, device=device), torch.tensor(Y, device=device)

my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')
"""
[0, 6, 12, 18]
[0, 18, 6, 12]
4 2 0
X:  tensor([[ 0,  1,  2,  3,  4,  5],
        [18, 19, 20, 21, 22, 23]]) 
Y: tensor([[ 1,  2,  3,  4,  5,  6],
        [19, 20, 21, 22, 23, 24]]) 

4 2 2
X:  tensor([[ 6,  7,  8,  9, 10, 11],
        [12, 13, 14, 15, 16, 17]]) 
Y: tensor([[ 7,  8,  9, 10, 11, 12],
        [13, 14, 15, 16, 17, 18]]) 
"""
  1. Adjacent sampling
def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None):
    if device is None:
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    corpus_len = len(corpus_indices) // batch_size * batch_size  # The length of the remaining sequence
    corpus_indices = corpus_indices[: corpus_len]  # Only the first corpus'len characters are reserved
    indices = torch.tensor(corpus_indices, device=device)
    print(indices)
    indices = indices.view(batch_size, -1)  # resize into (batch u size,), the key to adjacent sampling
    print(indices)
    batch_num = (indices.shape[1] - 1) // num_steps
    print(batch_num)
    for i in range(batch_num):
        i = i * num_steps
        print(i)
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

my_seq = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=2):
    print('X: ', X, '\nY:', Y, '\n')

"""
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
tensor([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]])
2
0
X:  tensor([[0, 1],
        [5, 6]]) 
Y: tensor([[1, 2],
        [6, 7]]) 

2
X:  tensor([[2, 3],
        [7, 8]]) 
Y: tensor([[3, 4],
        [8, 9]])
"""

Some words

According to the old rules, ask a few questions:

  1. What is the n-ary grammar language model and what are its disadvantages? What is the Markov chain?
  2. What is the significance of time series data sampling? What is the difference between random sampling and adjacent sampling? What does the yield keyword do?
51 original articles published, 69 praised, 90000 visitors+
Private letter follow

Posted on Thu, 13 Feb 2020 23:22:48 -0800 by maxxd