[NLP] actual combat of text representation

In the last article, we introduced the text representation of NLP https://blog.csdn.net/Prepared/article/details/94864658

But there is no code. In this blog, let's practice!

Common models of Chinese word segmentation: Jieba model, LAC model of Baidu. Here, we use the Jieba model for Chinese word segmentation.

Data set use: data of people's daily in May 1946. Dataset address: Https://github.com/fangj/rmrb/tree/master/example/may 1946

Jieba model:

Jieba is basically the best Python Chinese word segmentation component, which has the following three features:

Three word segmentation modes are supported:

  • Exact mode Full mode Search engine mode Support traditional word segmentation Support for custom dictionaries

Step 1: read all files to form the data set corpus

def getCorpus(self, rootDir):
    corpus = []
    r1 = u'[a-zA-Z0-9'!"#$%&\'()*+,-./:: ;<=>?@,. ?★,...[]<>?""''![\\]^_`{|}~]+'  # You can also customize filter characters here
    for file in os.listdir(rootDir):
        path  = os.path.join(rootDir, file)
        if os.path.isfile(path):
            # Print file address
            #print(os.path.abspath(path))
            # Get the content of the article. fromfile is mainly used to handle array pass
            # filecontext = np.fromfile(os.path.abspath(path))
            with open(os.path.abspath(path), "r", encoding='utf-8') as file:
                filecontext = file.read();
                #print(filecontext)
                # Participle minus sign
                filecontext = re.sub(r1, '', filecontext)
                filecontext = filecontext.replace("\n", '')
                filecontext = filecontext.replace(" ", '')
                seg_list = jieba.lcut_for_search(filecontext)
                corpus += seg_list
                #print(seg_list)
                #print("[exact mode]:" + "/". join(seg_list))
        elif os.path.isdir(path):
            TraversalFun.AllFiles(self, path)
    return corpus

1. Read the files under the folder circularly;

2. Using numpy to read the file to get the content of the file;

3. Remove special symbols to obtain Chinese content;

4. You can try all kinds of ways to use Jieba participle.

Step 2: calculate information entropy

Formula of information entropy:

#Construct a dictionary, count the frequency of each word, and calculate the information entropy
def calc_tf(corpus):
    #   Count the frequency of each word
    word_freq_dict = dict()
    for word in corpus:
        if word not in word_freq_dict:
            word_freq_dict[word] = 1
        word_freq_dict[word] += 1
    # Sort the words in this dictionary according to the number of occurrences. The higher the number of occurrences, the higher the ranking
    word_freq_dict = sorted(word_freq_dict.items(), key=lambda x:x[1], reverse=True)
    # Calculate TF probability
    word_tf = dict()
    # Information entropy
    shannoEnt = 0.0
    # According to the frequency, from high to low, start to traverse, not every word constructs an id
    for word, freq in word_freq_dict:
        # Calculate p(xi)
        prob = freq / len(corpus)
        word_tf[word] = prob
        shannoEnt -= prob*log(prob, 2)
    return word_tf, shannoEnt

Train of thought:

1. Cycle, calculate the frequency of each word, cycle a word, if there is one, add 1, if not, it is equal to 1

2. Calculate information entropy: calculate information entropy according to formula cycle.

Result:

Dataset size, size: 163039

Information entropy: 14.047837802885464

Let's try it

The road ahead is far away, let's cheer up [official account number] prepared

Jieba model: https://github.com/fxsjy/jieba

Source address: https://github.com/zhongsb/NLP-learning.git

Tags: github Python encoding Lambda

Posted on Sat, 14 Mar 2020 08:57:32 -0700 by R0d Longfella