[NLP] keyword co-occurrence / attribute co-occurrence matrix

[NLP] keyword co-occurrence / attribute co-occurrence matrix

[co-occurrence] it is no more than the frequency of two words appearing at the same time as an indicator to construct a matrix. The first column and the first row of the matrix are all words in the list of words, so the diagonal is generally set to 0 - that is, not counting yourself and yourself. If the matrix is M, M [i] [j]
It means how many times the i+1 word and the j+1 word appear together in the document set, and M [i] [J] = M [J] [i].

1. Build keyword matrix

Reference online code:
Thank Python building keyword co-occurrence matrix
There are two inputs:
DATA -- can be understood as a collection of documents, [[], [], [] ]Each small list represents an article
Keyword list is a list of keywords, and the co-occurrence matrix is also initialized based on it
I read the required data directly from the local database, and finally generate the matrix of the CSV file:
This part has been changed in MAIN (), and others are consistent with the reference code.

def mainkeyword():

    output_path = r'Casually naming.csv'
    # data = reader.readxls_col(keyword_path)[0]
    #Raw data
    data=dbdata()
    #Keywords list
    set_key_list =dbkey()
    # set_key_list = get_set_key(data)
    formated_data = format_data(data)
    matrix = build_matirx(set_key_list)
    matrix = init_matrix(set_key_list, matrix)
    result_matrix = count_matrix(matrix, formated_data)

    np.savetxt(output_path, result_matrix, fmt=('%s,'*len(matrix))[:-1])
    dbkey()
    dbdata()


The results are as follows: maybe many words only appear once, and the whole is sparse matrix.

2. Construct attribute co-occurrence matrix

When helping to do a project, the other party mentioned the concept of attribute, which can be understood as that several words can be classified into one attribute, so the constructed attribute co-occurrence matrix can greatly reduce the number of 0 parts of the matrix, and increase the co-occurrence times.
The keywordlist in the code can be turned into a dictionary of attributes, each attribute contains some keywords
About like this

Compared with the above keyword code, only the code of common frequency of calculation matrix and the code of Main() need to be changed:

def count_matrix_attr(matrix, formated_data):
    '''Calculate the co occurrence times of each attribute'''
    zd=dbattr()[0]
    for row in range(1, len(matrix)):
        # Traverse the first row of the matrix, skipping the element with subscript 0
        for col in range(1, len(matrix)):
                # Traverse the first column of the matrix, skipping the element with subscript 0
                # In fact, it is to skip the element with subscript [0] [0] in the matrix, because [0] [0] is empty and not a keyword
            if matrix[0][row] == matrix[col][0]:
                # If the extracted row keyword is the same as the extracted column keyword, the corresponding co occurrence number is 0, that is, the matrix diagonal is 0
                matrix[col][row] = str(0)
            else:
                counter = 0
                # Initialization counter
                for ech in formated_data:
                        # Traverse the formatted original data, and combine the extracted row keywords with the extracted column keywords,
                        # Then put it into each original data to query
                    for w1 in zd[matrix[0][row]]:
                        for w2 in zd[matrix[col][0]]:
                            if w1 in ech and w2 in ech:
                                counter += 1
                            else:
                                continue
                matrix[col][row] = str(counter)
    return matrix

def main():

    output_path = r'Casually naming+1.csv'
    # data = reader.readxls_col(keyword_path)[0]
    #Raw data
    data=dbdata()
    # dbkey()
    # dbdata()
    #Attribute list
    attr_list=dbattr()[1]
    # set_key_list = get_set_key(data)
    formated_data = format_data(data)
    matrix = build_matirx_attr(attr_list)
    matrix = init_matrix_attr(attr_list, matrix)
    result_matrix = count_matrix_attr(matrix, formated_data)

    np.savetxt(output_path, result_matrix, fmt=('%s,'*len(matrix))[:-1],encoding='utf-8')

The results are as follows

------------------------------2020-02-11 By EchoZhang---------------

Published 11 original articles, praised 0, visited 786
Private letter follow

Tags: Attribute Python Database encoding

Posted on Tue, 11 Feb 2020 04:29:13 -0800 by infini