Language model can be used to evaluate whether the text sequence is reasonable, that is, to calculate the probability of the sequence: P (W1, w2,..., wT) P (W1, w2,..., wT) P (W1, w2,..., wT). Among them, the statistical language model (Markov chain model) is widely used in natural language processing. This paper will introduce the theory of Markov chain and the form of input data set.

## Language model

- Language model

If there is a sequence (W1, w2,..., wT) (W1, w2,..., wT) (W1, w2,..., wT), the probability of its occurrence is

P(w1,w2,...,wT)=∏i=tTP(wt∣w1,w2,...,wt−1)=P(w1)P(w2∣w1)P(w3∣w1w2)...P(wT∣w1w2...wT−1) \begin{aligned} P(w_1,w_2,...,w_T)&=\prod_{i=t}^{T}{P(w_t|w_1,w_2,...,w_{t-1})}\\ &=P(w_1)P(w_2|w_1)P(w_3|w_1w_2)...P(w_T|w_1w_2...w_{T-1}) \end{aligned} P(w1,w2,...,wT)=i=t∏TP(wt∣w1,w2,...,wt−1)=P(w1)P(w2∣w1)P(w3∣w1w2)...P(wT∣w1w2...wT−1)

For a specific corpus, the probability of a word can be calculated by the relative word frequency of the word in the training data set. - n-order Markov chain

The model proposed above is called n-ary syntax, which has two problems: too large parameter space and sparse data. The former refers to the combination of a new parameter such as W1 W1 and W1 w2 W, so the number of parameters will be very large; the latter refers to the difficulty in finding words that meet the requirements of w1,w2,...,wT − 1W, w2,..., w {T-1} w1,w2,...,wT − 1 in the specific training set, and the word frequency will be very low. To solve this problem, we need to use Markov hypothesis.

The N-1 Markov chain model is based on the assumption that the current occurrence of this word is only related to the previous n-1 words. For example, when n=2, P(w3 ∣ w1,w2)=P(w3 ∣ w2) P(w3 w1,w2)=P(w3 w1,w2)=P(w3 ∣ w2).

## Input data set

- Read data set

with open('/home/kesci/input/jaychou_lyrics4703/jaychou_lyrics.txt') as f: corpus_chars = f.read() print(len(corpus_chars)) print(corpus_chars[: 40]) corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ') corpus_chars = corpus_chars[: 10000] """ 63282 Want a helicopter Want to fly to the universe with you Want to melt with you Melting in the universe Every day, every day """

- Build character index

idx_to_char = list(set(corpus_chars)) # De duplicate to get index to character mapping char_to_idx = {char: i for i, char in enumerate(idx_to_char)} # Character to index mapping vocab_size = len(char_to_idx) print(vocab_size) corpus_indices = [char_to_idx[char] for char in corpus_chars] # Turn each character into an index to get a sequence of indexes sample = corpus_indices[: 20] print('chars:', ''.join([idx_to_char[idx] for idx in sample])) print('indices:', sample) """ 1027 chars: Want to have a helicopter want to fly with you to the universe want and indices: [1022, 648, 1025, 366, 208, 792, 199, 1022, 648, 641, 607, 625, 26, 155, 130, 5, 199, 1022, 648, 641] """

It should be noted that in the previous chapter of text preprocessing, word segmentation, mapping and transformation tool functions have been encapsulated.

## Sampling of time series data

In the training process, a small batch of samples and tags are read at random each time. A sample of time series data usually contains consecutive characters, which depends on the setting of time steps. However, for a time series data, when the length is t and the time step number is n, there will be T-n samples, which have a large number of overlaps, so efficient sampling methods are needed. Here are two: random sampling and adjacent sampling.

- Random sampling

import torch import random def data_iter_random(corpus_indices, batch_size, num_steps, device=None): # Minus 1 is because for a sequence of length N, X contains at most the first n - 1 characters num_examples = (len(corpus_indices) - 1) // num_steps # The number of samples without overlapping is obtained by rounding example_indices = [i * num_steps for i in range(num_examples)] # The subscript of the first character of each sample in corpus indexes print(example_indices) random.shuffle(example_indices) #The key to random sampling is here print(example_indices) def _data(i): # Returns a sequence of num steps from i return corpus_indices[i: i + num_steps] if device is None: device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') for i in range(0, num_examples, batch_size): print(num_examples,batch_size,i) # Each time, select batch_size random samples batch_indices = example_indices[i: i + batch_size] # Subscript of the first character of each sample of the current batch X = [_data(j) for j in batch_indices] Y = [_data(j + 1) for j in batch_indices] yield torch.tensor(X, device=device), torch.tensor(Y, device=device) my_seq = list(range(30)) for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6): print('X: ', X, '\nY:', Y, '\n') """ [0, 6, 12, 18] [0, 18, 6, 12] 4 2 0 X: tensor([[ 0, 1, 2, 3, 4, 5], [18, 19, 20, 21, 22, 23]]) Y: tensor([[ 1, 2, 3, 4, 5, 6], [19, 20, 21, 22, 23, 24]]) 4 2 2 X: tensor([[ 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17]]) Y: tensor([[ 7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18]]) """

- Adjacent sampling

def data_iter_consecutive(corpus_indices, batch_size, num_steps, device=None): if device is None: device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') corpus_len = len(corpus_indices) // batch_size * batch_size # The length of the remaining sequence corpus_indices = corpus_indices[: corpus_len] # Only the first corpus'len characters are reserved indices = torch.tensor(corpus_indices, device=device) print(indices) indices = indices.view(batch_size, -1) # resize into (batch u size,), the key to adjacent sampling print(indices) batch_num = (indices.shape[1] - 1) // num_steps print(batch_num) for i in range(batch_num): i = i * num_steps print(i) X = indices[:, i: i + num_steps] Y = indices[:, i + 1: i + num_steps + 1] yield X, Y my_seq = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=2): print('X: ', X, '\nY:', Y, '\n') """ tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) tensor([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]) 2 0 X: tensor([[0, 1], [5, 6]]) Y: tensor([[1, 2], [6, 7]]) 2 X: tensor([[2, 3], [7, 8]]) Y: tensor([[3, 4], [8, 9]]) """

## Some words

According to the old rules, ask a few questions:

- What is the n-ary grammar language model and what are its disadvantages? What is the Markov chain?
- What is the significance of time series data sampling? What is the difference between random sampling and adjacent sampling? What does the yield keyword do?