Chinese text classification practice

This blog will systematically introduce the process and related algorithms of Chinese text classification. Starting from the background of text mining and centering on text classification algorithm, this paper introduces the process and related knowledge of Chinese text classification project. Knowledge points involve Chinese word segmentation, vector space model, TF-IDF method, several typical text classification algorithms and evaluation indexes.

This article mainly includes:

  • Naive Bayesian algorithm
  • KNN nearest neighbor algorithm.

2.1 concept of text mining and text classification

In short, text mining is the process of extracting some unknown and possibly used knowledge from a large number of known text data, that is, the process of finding knowledge from unstructured text. The main fields of text mining are:

  1. Search and information retrieval: storage and retrieval of text documents, including search engine and keyword search.
  2. Text clustering: use clustering method to group and classify words, fragments, paragraphs or documents.
  3. Text classification: group and classify fragments, paragraphs or files, and mark the sample model after training based on the classification method of data mining.
  4. Web Mining: mining data and text on the Internet, with special attention to the scale and interrelationship of the network.
  5. Information extraction: the process of identifying facts and relationships related to extraction from unstructured text; extracting structured data from unstructured or semi-structured text.
  6. Natural language processing: Taking speech as a meaningful and regular system symbol, the task of parsing and understanding language at the bottom.
  7. Concept extraction: divide words and phrases into semantic similar groups.

In the analysis of machine learning data sources, the most common topic of knowledge discovery is to transform data objects or events into predetermined categories, and then carry out special processing according to categories, which is the basic task of classification system. To achieve this task, we first need to give a group of categories, and then according to the corresponding text set of these categories, constitute the training data set. The training combination includes both classified text files and category information. Nowadays, automatic text classification is widely used in text retrieval, spam filtering, web page hierarchical directory, automatic metadata generation, subject detection and many other applications.

At present, there are mainly two text classification methods, one is based on pattern system, the other is based on classification model. Pattern system, also known as expert system, classifies knowledge into regular expressions. Classification model is also called machine learning, which is a generalized induction process. It uses a group of pre classification examples to establish classification through training. At present, because the number of files is increasing exponentially, the trend is turning to machine learning, a technology based on automatic classification.

2.2 text classification items

The text classification technology and process of Chinese speech mainly include the following steps: (it doesn't matter if you don't understand it at present, just understand it, which will be explained in detail later)

  1. Preprocessing: remove the noise information of text, such as HTML tags, text format conversion, detection of statement boundaries, etc.
  2. Chinese word segmentation: use Chinese word segmentation device for text word segmentation, and remove the stop words.
  3. Construction of word vector space: Statistics of word frequency of text and generation of word vector space of text
  4. Weight measurement - TF-IDF method: use TF-IDF to find feature words and extract them to reflect the characteristics of the document subject.
  5. Classifiers: using algorithms to train classifiers
  6. Evaluation of classification results: analysis of test results of classifier.

2.2.1 text preprocessing

The task of text preprocessing is to transform unstructured and semi-structured text into structured form, that is, vector space model. Document preprocessing includes the following steps:

1. Select the range of text to process

For a long document, we need to decide whether to use the whole document or cut the document into sections, paragraphs or sentences. The selection of appropriate scope depends on the target of text mining task: for the task of classification or clustering, the whole document is often regarded as the processing unit; for emotional analysis, automatic document summarization or information retrieval, paragraphs or chapters may be more appropriate.

2. Establish a classified text prediction database.

Text expectations are generally divided into two categories:

1) training corpus

Training corpus refers to the text resources that have been classified. At present, the better Chinese classification corpora are tan Songbo's and Sogou's news classification corpora. These can be downloaded by Google search.

2) test set corpus

Test set is the text prediction to be classified, which can be a part of the training set or the text prediction from external sources. The external sources are relatively free. Generally, the actual project is to solve the problem of new text classification.

There are many ways to get classified text resources, such as through companies, libraries or even Taobao saltfish. Of course, the best way is through the network. Generally, Xu Ya obtains the network text in batches and downloads it with a web crawler, which is a mature technology.

3. Text format conversion

No matter what kind of processing method is adopted for different formats of text, it is necessary to convert them into plain text files, such as web pages, PDF, picture files, etc.

A web page text as an example, no matter our task is classification, clustering or information extraction, the basic work is to find ways to find knowledge from the text, and some texts, such as HTML < Table > < Table > information is generally structured, so it is useless for machine learning classification system, but pattern based system is very valuable. If such a system is involved in the classification, when removing the HTL label, the table should be kept or extracted as the auxiliary classification basis of the document.

After filtering out these meaningful tags, we need to remove the rest of HTML tags, so we need to convert the text to the semi-structured text in the format of TXT or XML. In order to improve the performance, generally python uses lxml library to remove HTML tags. This is an XML extension library written in C language, which has much higher performance than re regular expression, and is suitable for massive network documents This format conversion.

python code

Install lxml:python -m pip install lxml

from lxml import etree,html

path = "C:\\Users\\Administrator\\Desktop\\Baidu once, you will know.html"

content = open(path,"rb").read()
page = html.document_fromstring(content)#Parsing file
text = page.text_content()#Remove all labels
print(text)

4. Detection boundary: mark the end of sentence

Sentence boundary detection is the process of decomposing the whole document into individual sentences. For Chinese text, it's looking for "." "?" "!" Equal punctuation marks are used as the basis of sentences. However, with the popularity of English, there is also the use of "." as the end sign of a sentence. This is easy to confuse with abbreviations of some words. If you break a sentence here, it is easy to make mistakes. In this case, heuristic rules or statistical classification techniques can be used to correctly identify most sentence boundaries.

2.2.2 introduction to Chinese word segmentation

Chinese word segmentation refers to the segmentation of a Chinese character into independent words. We know that in English words, spaces are used as natural separators, while in Chinese, words are only words. Sentence segments can be simply delimited by star separators, but there is no formal separators for a single word. Chinese word segmentation is not only a big problem of Chinese text classification, but also one of the core problems of Chinese natural language processing.

Word segmentation is the most basic and bottom module in natural language processing. The accuracy of word segmentation has a great impact on subsequent application modules. Throughout the whole natural language processing field, the structural representation of text or sentence is the most core task of language processing. At present, text structured representation can be simply divided into four categories: word vector space model, topic model, tree representation of dependency syntax, graph representation of RDF. The above four text representations are based on participle.

Here we use the jieba word segmentation as an explanation. jieba is small and efficient. It is a word segmentation system specially developed in Python language, which takes up less resources and is more than enough for non professional documents.

Install jieba

python test jieba Code:

import sys
import os
import jieba

seg_list = jieba.cut("Xiaoming graduated from Tsinghua University in 2019",cut_all=False)
#Default segmentation
print("Default Mode: "," ".join(seg_list))

#default cut_all=False
seg_list = jieba.cut("Xiaoming graduated from Tsinghua University in 2019")
print("Default mode:"," ".join(seg_list))

#Total segmentation
seg_list = jieba.cut("Xiaoming graduated from Tsinghua University in 2019",cut_all=True)
print("Full Mode: "," ".join(seg_list))

#Search engine mode
seg_list = jieba.cut_for_search("Xiaoming graduated from computer graduate of Chinese Academy of Sciences, and later went to work in Huawei")
print("Search mode:","/ ".join(seg_list))

Running screenshot:

The following code will use jieba to classify the corpus

Screenshot of corpus:

 

import sys
import os
import jieba

def savefile(savepath,content):
    fp = open(savepath,"wb")
    fp.write(content)
    fp.close()

def readfile(path):
    fp = open(path,"rb")
    content = fp.read()
    fp.close()
    return content

#The following is the main procedure of word segmentation in corpus:
#The approach of corpus without word segmentation
corpus_path = "C:\\Users\\Administrator\\Desktop\\train_corpus_small\\"
#The path of predicative database of segmentation classification
seg_path = "C:\\Users\\Administrator\\Desktop\\train_corpus_seg\\"

#Get all subdirectories under corpus path
catelist = os.listdir(corpus_path)
print("Corpus subdirectory:",catelist)

#Get all files in each directory
for mydir in catelist:
    class_path = corpus_path+mydir+"\\"#Spell the path of classification subdirectory
    seg_dir = seg_path+mydir+"\\"#Corpus classification catalogue after word segmentation
    if not os.path.exists(seg_dir):
        os.makedirs(seg_dir)
    file_list = os.listdir(class_path)#Get files under category directory
    for file_path in file_list:#Traverse files in category directory
        fullname = class_path + file_path#Spell the full path of file name
        content = readfile(fullname).strip()#Read file contents
        content = content.replace("\r\n".encode(),"".encode()).strip()#Delete line feed
        content_seg = jieba.cut(content)#Word segmentation for document content
        #Save the processed file to the corpus directory after word segmentation
        savefile(seg_dir+file_path," ".join(content_seg).encode())

print("End of segmentation in Chinese Corpus!!!")

Running screenshot:

Of course, in the practical application, for the convenience of generating vector space model, the text information after segmentation is transformed into text vector information and objectified. Here you need to use the Bunch data structure of scikit learn library. windo can install sklearn with the following command: Python - M PIP install sklearn

 

# -*- coding: utf-8 -*-

import sys  
import os 
import jieba
import pickle
from sklearn.datasets.base import Bunch

# Save to file
def savefile(savepath,content):
   fp = open(savepath,"wb")
   fp.write(content)
   fp.close()
   
# read file 
def readfile(path):
   fp = open(path,"rb")
   content = fp.read()
   fp.close()
   return content
   
# Bunch Class provides a key,value Object form of
# target_name:List of all taxonomic set names
# label:List of classification labels for each file
# filenames:File path
# contents:Word vector form of document after word segmentation  
bunch = Bunch(target_name=[],label=[],filenames=[],contents=[])    

wordbag_path = "C:\\Users\\Administrator\\Desktop\\test_set.dat"# Training set path to save
seg_path = "C:\\Users\\Administrator\\Desktop\\train_corpus_seg\\"# Corpus path after word segmentation

catelist = os.listdir(seg_path)  # Obtain seg_path All subdirectories under
bunch.target_name.extend(catelist)
# Get all files in each directory
for mydir in catelist:
   class_path = seg_path+mydir+"/"    # Spell the path of the classification subdirectory
   file_list = os.listdir(class_path)    # Obtain class_path All files under
   for file_path in file_list:           # Traverse files under category directory
      fullname = class_path + file_path   # Spell the full path of file name
      bunch.label.append(mydir)
      bunch.filenames.append(fullname)
      bunch.contents.append(readfile(fullname).strip())     # Read file contents
      
#Object persistence                                                                                              
file_obj = open(wordbag_path, "wb")
pickle.dump(bunch,file_obj)                      
file_obj.close()

print("End of building text object!!!")

2.2.3 introduction to scikit learn Library

Screenshot of scikit learn's website:

This is a Python library for machine learning, based on Scipy.

1. Module classification:

2. Main features

  • Simple and efficient data mining and analysis
  • No access restrictions, can be reused in any case
  • Based on Numpy,scipy and Matplotlib
  • Use commercial open source protocol - BSD license

http://scikit-learn.org provides a lot of tutorial resources and source code for algorithm learning, which is a good learning website, such as the derivation process of naive Bayesian formula of algorithm provided by the website: https://scikit-learn.org/stable/modules/naive_bayes.html

You can also download the source code for the entire project from https://github.com/scikit-learn/scikit-learn.

2.2.4 vector space model

Vector space model is the basis of many related technologies, such as recommendation system, search engine and so on.

The vector space model represents the text as a vector, and each feature of this vector is represented as the words appearing in the text. Generally, each different string in the training set is regarded as a dimension, including common words, special words, phrases and other types of pattern strings, such as email address and URL. At present, most text mining systems store text as a vector space representation, because it is easy to use machine learning algorithm. The disadvantage is that for large-scale text classification, it will lead to extremely high-dimensional space, and the dimension of vector can easily reach tens of millions of dimensions. Therefore, in order to save storage space and improve search efficiency, some words and words will be automatically filtered out before text classification. The filtered words or words are called stop words. This kind of words are commonly used words with fuzzy meaning, and some modal particles, which can't play the role of classification feature to the text.

Read the code of the disabled vocabulary

import sys
import os
import jieba
import pickle
def readfile(path):
    fp = open(path,"rb")
    content = fp.read()
    fp.close()
    return content.decode("utf-8")


stopword_path = "C:\\Users\\Administrator\\Desktop\\hlt_stop_words.txt"

stpwrdlist = readfile(stopword_path).splitlines()

print(stpwrdlist)

2.2.5 weight strategy: TF-IDF method

On the basis of machine learning, we mentioned the vector space model, that is, the word bag model, which transforms the words and pattern strings in the text into numbers, and the whole text set into the word vector space with equal dimensions. A few chestnuts: suppose we have three texts:

Text 1: my dog ate my home work

Text 2: My cat ate my sandwich

Text 3: a help ate the home work

There are 9 non repetitive words in the three text bags, which are a(1),ate(3),cat(1),dolphin(1),dog(1),homework(2),my(3),sandwich(1),the(2). The frequency information is in parentheses. Intuitively, the word vector representation of the text can be expressed in binary form, for example:

Text 1: 0,1,0,0,1,1,1,0,0

Text2: 0,1,1,0,0,0,1,1,1

Text3: 1,1,0,1,0,1,0,0,1

You can also count word vectors

Text 1:0,1,0,0,1,1,2,0,0

Text2: 0,1,1,0,0,0,1,1,1

Text3: 1,1,0,1,0,1,0,0,1

Then the normalization is carried out:

Here is another question: how to reflect the word frequency information in the word bag?

  

1.TF-IDF weight strategy

TP-IDF weight strategy means the inverse document frequency of word frequency, which means that if a word or phrase has a good ability to distinguish categories and rarely appears in other articles, it is considered that the word or phrase has a good ability to distinguish categories and is suitable for classification. For example, in the previous example, the frequency of "my" in text 1 is 2, because it appears twice in the document, and twice in text 2. TF-IDF believes that high-frequency words should have high weight. The word "my" often appears in the text, and it not only appears in a single text, but almost appears in every document. The inverse document frequency is to use the frequency of the word entry to offset the impact of the word frequency on the weight, so as to get a lower weight.

Word frequency refers to the frequency of a given word in a file. The calculation formula is:

In this formula, the numerator is the number of times the word appears in the file, and the denominator is the sum of the number of times all the words appear in the file.

Inverse document frequency (IDF) is a measure of the general importance of words. The calculation formula is:

Where|D|: the total number of files in the corpus

j: number of files containing words. If the word is not in the corpus, it will lead to zero denominator, so it is generally used

As denominator. Then calculate the product of TF and IDF.

python code implementation:

import sys
import os
from sklearn.datasets.base import Bunch#Introduce Bunch class
import pickle #Introducing persistence classes
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

#read Bunch object
def readbunchobj(path):
    file_obj = open(path,"rb")
    bunch = pickle.load(file_obj)
    file_obj.close()
    return bunch

#Write in Bunch object
def writebunchobj(path,bunchobj):
    file_obj = open(path,"wb")
    pickle.dump(bunchobj,file_obj)
    file_obj.close()

#Generate from training set TF-IDF Vector word bag
#Word vector after word segmentation Bunch object
#The path of word vector space is saved. The code generated by the word vector space is in the previous part
path = "C:\\Users\\Administrator\\Desktop\\train_set.dat"
bunch = readbunchobj(path)

# read file
def readfile(path):
    fp = open(path,"rb")
    content = fp.read()
    fp.close()
    return content

# Read stoplist
stopword_path = "C:\\Users\\Administrator\\Desktop\\hlt_stop_words.txt"
stpwrdlst = readfile(stopword_path).splitlines()

#structure TF-IDF Word vector space object
tfidfspace = Bunch(target_name=bunch.target_name,label=bunch.label,
                   filenames = bunch.filenames,tdm=[],vocabulary={})

#Use TfidVectorizer Initialize vector space model
vectorizer = TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5)
transformer=TfidfTransformer()#This class will count the TF-IDF Weight
#Convert the text to word frequency matrix and save the dictionary file separately
tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)
tfidfspace.vocabulary = vectorizer.vocabulary_
print(tfidfspace.vocabulary)
#Creating the persistence of word bag
space_path = "C:\\Users\\Administrator\\Desktop\\tfdifspace.dat"
writebunchobj(space_path,tfidfspace)
print("OK")

2.2.6 use naive Bayesian classification module

At present, the most commonly used classification methods are KNN nearest neighbor algorithm, naive Bayes algorithm and support vector machine algorithm. The principle of KNN nearest neighbor algorithm is simple and the accuracy is good, that is, the speed is slow; naive Bayes algorithm has the best effect and high accuracy; the advantage of support vector machine algorithm is that it supports the case of linear indivisibility and the accuracy is moderate.

In this section, the naive Bayesian algorithm of scikit learn is selected for text classification. The test set is randomly selected from the document set of the training set, with 10 documents in each classification, and the documents below 1KB are filtered. Training steps: first is word segmentation, and then the word vector file is generated until the word vector model is generated. When training the word vector model, we need to load the training set word bag, map the word vector generated by the test to the dictionary of the training set word bag, and generate the vector space model.

Polynomial Bayes algorithm is used to classify the test text and return the classification accuracy.

1. Create word bag persistence

# -*- coding: utf-8 -*-
import sys
import os
from sklearn.datasets.base import Bunch#Introduce Bunch class
import pickle #Introducing persistence classes
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

#read Bunch object
def readbunchobj(path):
    file_obj = open(path,"rb")
    bunch = pickle.load(file_obj,encoding="utf-8")
    file_obj.close()
    return bunch

#Write in Bunch object
def writebunchobj(path,bunchobj):
    file_obj = open(path,"wb")
    pickle.dump(bunchobj,file_obj)
    file_obj.close()

# read file
def readfile(path):
    fp = open(path,"rb")
    content = fp.read()
    fp.close()

    return content

#Word vector after word segmentation Bunch object
path = "C:\\Users\\Administrator\\Desktop\\data\\test_word_bag\\test_set.dat"
bunch = readbunchobj(path)

# Read stoppage list
stopword_path = "C:\\Users\\Administrator\\Desktop\\data\\train_word_bag\\hlt_stop_words.txt"
stpwrdlst = readfile(stopword_path).splitlines()

#structure TF-IDF Word vector space object
tfidfspace = Bunch(target_name=bunch.target_name,label=bunch.label,
                   filenames = bunch.filenames,tdm=[],vocabulary={})

#Build test set TF-IDF vector space
testspace = Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={})
#Word bag for training set introduction
trainbunch = readbunchobj("C:\\Users\\Administrator\\Desktop\\data\\test_word_bag\\tfdifspace.dat")
#Use TfidVectorrizer Initialize vector space model
vectorizer = TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5
                             ,vocabulary=trainbunch.vocabulary)
transformer=TfidfTransformer
testspace.tdm = vectorizer.fit_transform(bunch.contents)
testspace.vocabulary = trainbunch.vocabulary

#Creating the persistence of word bag
space_path = "C:\\Users\\Administrator\\Desktop\\data\\test_word_bag\\testspace.dat"#Word vector space saving path
writebunchobj(space_path,testspace)

2. Perform polynomial Bayesian algorithm to classify test text.

from sklearn.naive_bayes import MultinomialNB#Import polynomial Bayesian algorithm package
import pickle

#read Bunch object
def readbunchobj(path):
    file_obj = open(path,"rb")
    bunch = pickle.load(file_obj,encoding="utf-8")
    file_obj.close()
    return bunch

#Import training set vector space
trainpath = r"C:\Users\Administrator\Desktop\data\test_word_bag\tfdifspace.dat"
train_set = readbunchobj(trainpath)

#Import test set vector space
testpath = r"C:\Users\Administrator\Desktop\data\test_word_bag\testspace.dat"
test_set = readbunchobj(testpath)

#Application of naive Bayes algorithm
#alpha:0.001 alpha The smaller the number of iterations, the higher the accuracy
clf = MultinomialNB(alpha= 0.001).fit(train_set.tdm,train_set.label)

#Forecast classification results
predicted = clf.predict(test_set.tdm)
total = len(predicted)
rate = 0
for flabel,file_name,expct_cate in zip(test_set.label,test_set.filenames,predicted):
    if flabel != expct_cate:
        rate += 1
        print(file_name,":Actual category:",flabel,"-->Prediction category:",expct_cate)

print("error rate",float(rate)*100/float(total),"%")

Operation result

The error 3143.txt is copied from education to art..........

2.2.7 evaluation of classification results

There are three basic indicators of algorithms in machine learning.

1. Recall rate refers to the ratio between the number of relevant documents retrieved and the number of all relevant documents in the document library, which measures the recall rate of the retrieval system.

Recall rate = total number of relevant documents retrieved by the system / all relevant documents of the system

2. Accuracy = relevant documents retrieved by the system / total number of documents retrieved by the system

  

The main difference between accuracy rate and recall rate is to remember that they are different denominators:

python code

from sklearn.naive_bayes import MultinomialNB#Import polynomial Bayesian algorithm package
import pickle
from sklearn import metrics
from sklearn.metrics import precision_score #sklearn Accuracy in

#read Bunch object
def readbunchobj(path):
    file_obj = open(path,"rb")
    bunch = pickle.load(file_obj,encoding="utf-8")
    file_obj.close()
    return bunch

# Define classification precision function
def metrics_result(actual,predict):
    print("Accuracy rate:",metrics.precision_score(actual,predict,average='macro'))
    print("recall:", metrics.recall_score(actual, predict,average='macro'))
    print("fl-score:", metrics.f1_score(actual, predict,average='macro'))

#Import training set vector space
trainpath = r"C:\Users\Administrator\Desktop\data\test_word_bag\tfdifspace.dat"
train_set = readbunchobj(trainpath)

#Import test set vector space
testpath = r"C:\Users\Administrator\Desktop\data\test_word_bag\testspace.dat"
test_set = readbunchobj(testpath)

#Application of naive Bayes algorithm
#alpha:0.001 alpha The smaller the number of iterations, the higher the accuracy
clf = MultinomialNB(alpha= 0.001).fit(train_set.tdm,train_set.label)

#Forecast classification results
predicted = clf.predict(test_set.tdm)

metrics_result(test_set.label,predicted)

Another understanding of accuracy and recall:

For example, there are 1400 carp, 300 shrimp and 300 turtle in a pond. Now the aim is to catch carp. Cast a big net and catch 700 carp, 200 shrimp and 100 turtle. Assuming that carp is the target of our correct arrest, these indicators are as follows:

Accuracy = 700 / (700 + 200 + 100) = 70%

Recall rate = 700 / 1400 = 50%

F1 value = 70% * 50% * 2 / (70% + 50%) = 58.3%

Let's see how these indicators change if all the carp, shrimp and turtle in the pool are wiped out:

Accuracy = 1400 / (1400 + 300 + 300) = 70%

Recall rate = 1400 / 1400 = 100%

F value = 70% * 100% * 2 / (70% + 100%) = 82.35%

 

It can be seen that the accuracy rate is the proportion of the target results in the evaluation of the captured results; the recall rate, as the name implies, is the proportion of the target categories in the areas of concern; and the F value is the evaluation index that integrates the two indexes, which is used to comprehensively reflect the overall index.

Of course, we hope that the higher the accuracy of retrieval results, the better the recall rate, but in fact, there are contradictions between the two in some cases. For example, in extreme cases, if we search only one result and it is accurate, the accuracy rate is 100%, but the recall rate is very low; if we return all the results, for example, the recall rate is 100%, but the accuracy rate is very low. Therefore, in different occasions, we need to judge whether we want high accuracy or recall rate. If it is an experimental study, the accuracy recall curve can be drawn to help the analysis.

Tags: Python network encoding Database

Posted on Sun, 12 Apr 2020 23:04:38 -0700 by techjosh