[machine learning] naive Bayesian text classification

Naive Bayes

1. Introduction to naive Bayesian algorithm
Naive Bayes is a classification algorithm based on Bayes. Naive Bayes classification algorithm calculates the prior probability of the target object, uses Bayes theorem to calculate the posterior probability, and then compares the posterior probability to classify the decision
The naive Bayesian algorithm is described as follows:
For sample data sets:

Characteristic attributes of each sample data:

Class variables:

D can be divided into D categories. Naive Bayes formula can be expressed as follows:

P(Y|X) is called a posterior probability, P(Y) is a prior probability, P(X|Y) is a conditional probability or likelihood function
The conditional independence hypothesis can be formalized as follows:

By calculating each conditional probability, we can get the similar conditional probability, and then get the posterior probability

For example, we want to estimate the type of fruit. If there is a group of events, such as yellow, long, curved Then we can judge that this is a banana. Although Huang, long and curved events may depend on each other, in naive Bayesian model, we assume that they are independent of each other, which is his simplicity. It is worth noting that the simplicity here corresponds to the simple meaning of the English word naive, so it can also be understood as a very naive and simple estimation. Consider each feature in the data as an independent distribution.
2 naive Bayes for text classification
The experiment is to do text classification task for Sogou news data set, data source There are the following categories: automobile, finance, technology, health, sports, education, culture, military, entertainment and fashion. The following is the excerpt from Sogou news data:

ipython notebook:

import pandas as pd
import jieba
import numpy
#pip install jieba
#read in data
df_news = pd.read_csv('./data/val.txt',sep='\t',names=['category','theme','URL','content'],encoding='utf-8')
df_news = df_news.dropna()
df_news.head()
#Using the word segmentation classifier to segment words
content = df_news.content.values.tolist()
print (content[1000])
content_S = []
for line in content:
    current_segment = jieba.lcut(line)
    if len(current_segment) > 1 and current_segment != '\r\n': #Newline character
        content_S.append(current_segment)
#Storing participles in the form of datafarm
content_S[1000]
df_content=pd.DataFrame({'content_S':content_S})
df_content.head()
#Introduction of stop words
stopwords=pd.read_csv("stopwords.txt",index_col=False,sep="\t",quoting=3,names=['stopword'], encoding='utf-8')
stopwords.head(20)
def drop_stopwords(contents,stopwords):
    contents_clean = []
    all_words = []
    for line in contents:
        line_clean = []
        for word in line:
            if word in stopwords:
                continue
            line_clean.append(word)
            all_words.append(str(word))
        contents_clean.append(line_clean)
    return contents_clean,all_words
    #print (contents_clean)
        
contents = df_content.content_S.values.tolist()    
stopwords = stopwords.stopword.values.tolist()
contents_clean,all_words = drop_stopwords(contents,stopwords)

df_content=pd.DataFrame({'contents_clean':contents_clean})
df_content.head()

df_all_words=pd.DataFrame({'all_words':all_words})
df_all_words.head()

words_count=df_all_words.groupby(by=['all_words'])['all_words'].agg({"count":numpy.size})
words_count=words_count.reset_index().sort_values(by=["count"],ascending=False)
words_count.head()

from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)

wordcloud=WordCloud(font_path="./data/simhei.ttf",background_color="white",max_font_size=80)
word_frequence = {x[0]:x[1] for x in words_count.head(100).values}
wordcloud=wordcloud.fit_words(word_frequence)
plt.imshow(wordcloud)
#TF-IDF extraction keywords
import jieba.analyse
index = 2400
print (df_news['content'][index])
content_S_str = "".join(content_S[index])  
print ("  ".join(jieba.analyse.extract_tags(content_S_str, topK=5, withWeight=False)))
#LDA subject model (format requirement: list of list form, whole corpus with good segmentation)
from gensim import corpora, models, similarities
import gensim
#Mapping, equivalent to word bag
dictionary = corpora.Dictionary(contents_clean)
corpus = [dictionary.doc2bow(sentence) for sentence in contents_clean]
#Similar to Kmeans specifying K value by themselves
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)
for topic in lda.print_topics(num_topics=20, num_words=5):
    print (topic[1])
#Assignment of training data
df_train=pd.DataFrame({'contents_clean':contents_clean,'label':df_news['category']})
df_train.tail()
df_train.label.unique()
label_mapping = {"automobile": 1, "Finance": 2, "science and technology": 3, "Healthy": 4, "Sports":5, "education": 6,"Culture": 7,"Military": 8,"entertainment": 9,"fashion": 0}
df_train['label'] = df_train['label'].map(label_mapping)
df_train.head()
#Segmentation data set
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df_train['contents_clean'].values, df_train['label'].values, random_state=1)
words = []
for line_index in range(len(x_train)):
    try:
        #x_train[line_index][word_index] = str(x_train[line_index][word_index])
        words.append(' '.join(x_train[line_index]))
    except:
        print (line_index,word_index)
words[0]        
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
print(cv_fit.toarray().sum(axis=0))
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer(ngram_range=(1,4))
cv_fit=cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
print(cv_fit.toarray().sum(axis=0))
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(analyzer='word', max_features=4000,  lowercase = False)
vec.fit(words)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(words), y_train)

test_words = []
for line_index in range(len(x_test)):
    try:
        #x_train[line_index][word_index] = str(x_train[line_index][word_index])
        test_words.append(' '.join(x_test[line_index]))
    except:
         print (line_index,word_index)
test_words[0]
classifier.score(vec.transform(test_words), y_test)
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer='word', max_features=4000,  lowercase = False)
vectorizer.fit(words)
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vectorizer.transform(words), y_train)
classifier.score(vectorizer.transform(test_words), y_test)

I have already run. Final accuracy:

Import the related library to run!

x,j
Published 2 original articles, won praise 0, visited 3
Private letter follow

Tags: encoding IPython pip

Posted on Tue, 03 Mar 2020 19:33:38 -0800 by gukii