[co-occurrence] it is no more than the frequency of two words appearing at the same time as an indicator to construct a matrix. The first column and the first row of the matrix are all words in the list of words, so the diagonal is generally set to 0 - that is, not counting yourself and yourself. If the matrix is M, M [i] [j]
It means how many times the i+1 word and the j+1 word appear together in the document set, and M [i] [J] = M [J] [i].
Reference online code:
Thank Python building keyword co-occurrence matrix
There are two inputs:
DATA -- can be understood as a collection of documents, [, ,  ]Each small list represents an article
Keyword list is a list of keywords, and the co-occurrence matrix is also initialized based on it
I read the required data directly from the local database, and finally generate the matrix of the CSV file:
This part has been changed in MAIN (), and others are consistent with the reference code.
def mainkeyword(): output_path = r'Casually naming.csv' # data = reader.readxls_col(keyword_path) #Raw data data=dbdata() #Keywords list set_key_list =dbkey() # set_key_list = get_set_key(data) formated_data = format_data(data) matrix = build_matirx(set_key_list) matrix = init_matrix(set_key_list, matrix) result_matrix = count_matrix(matrix, formated_data) np.savetxt(output_path, result_matrix, fmt=('%s,'*len(matrix))[:-1]) dbkey() dbdata()
The results are as follows: maybe many words only appear once, and the whole is sparse matrix.
When helping to do a project, the other party mentioned the concept of attribute, which can be understood as that several words can be classified into one attribute, so the constructed attribute co-occurrence matrix can greatly reduce the number of 0 parts of the matrix, and increase the co-occurrence times.
The keywordlist in the code can be turned into a dictionary of attributes, each attribute contains some keywords
About like this
Compared with the above keyword code, only the code of common frequency of calculation matrix and the code of Main() need to be changed:
def count_matrix_attr(matrix, formated_data): '''Calculate the co occurrence times of each attribute''' zd=dbattr() for row in range(1, len(matrix)): # Traverse the first row of the matrix, skipping the element with subscript 0 for col in range(1, len(matrix)): # Traverse the first column of the matrix, skipping the element with subscript 0 # In fact, it is to skip the element with subscript   in the matrix, because   is empty and not a keyword if matrix[row] == matrix[col]: # If the extracted row keyword is the same as the extracted column keyword, the corresponding co occurrence number is 0, that is, the matrix diagonal is 0 matrix[col][row] = str(0) else: counter = 0 # Initialization counter for ech in formated_data: # Traverse the formatted original data, and combine the extracted row keywords with the extracted column keywords, # Then put it into each original data to query for w1 in zd[matrix[row]]: for w2 in zd[matrix[col]]: if w1 in ech and w2 in ech: counter += 1 else: continue matrix[col][row] = str(counter) return matrix
def main(): output_path = r'Casually naming+1.csv' # data = reader.readxls_col(keyword_path) #Raw data data=dbdata() # dbkey() # dbdata() #Attribute list attr_list=dbattr() # set_key_list = get_set_key(data) formated_data = format_data(data) matrix = build_matirx_attr(attr_list) matrix = init_matrix_attr(attr_list, matrix) result_matrix = count_matrix_attr(matrix, formated_data) np.savetxt(output_path, result_matrix, fmt=('%s,'*len(matrix))[:-1],encoding='utf-8')
The results are as follows
------------------------------2020-02-11 By EchoZhang---------------