PageRank - Principle and Code Analysis

Preface

Because recently to prepare for graduation design, the paper design to extract the key words in sentences, to use the TextRank algorithm. It was improved by PageRank.

source

Used in search engines to rank the importance of web pages.
The founder of the book, Mr. Page and Brin, referred to the academic method of evaluating the importance of papers. The more times they quoted, the more important they were.
PageRank's ideas are similar:

• If a page is linked by many other pages, it is important and has high weight.
• If a page with high PageRank value links to a page, the weight of the page is also increased.

Model

Web pages are abstracted into nodes. For example, four pages A,B,C,D. Pictured For A, A points to B, C and D. Because they are divided into three points, each branch can only calculate 1/3 of the weight.
The input format is in json format, as follows:

{"A":["B","C","D"], "B":["A","C"], "C":["D"], "D":["A","B"]}

Write the above figure in the form of transition matrix M:
M=[A→AB→AC→AD→AA→BB→BC→BD→BA→CB→CC→CD→CA→DB→DC→DD→D] M = \begin{bmatrix} A\rightarrow A & B\rightarrow A & C\rightarrow A &D\rightarrow A \\ A\rightarrow B & B\rightarrow B & C\rightarrow B &D\rightarrow B \\ A\rightarrow C & B\rightarrow C & C\rightarrow C &D\rightarrow C \\ A\rightarrow D & B\rightarrow D & C\rightarrow D &D\rightarrow D \\ \end{bmatrix} M=⎣⎢⎢⎡​A→AA→BA→CA→D​B→AB→BB→CB→D​C→AC→BC→CC→D​D→AD→BD→CD→D​⎦⎥⎥⎤​
The corresponding example M is:

[[0.         0.5        0.         0.5       ]
[0.33333333 0.         0.         0.5       ]
[0.33333333 0.5        0.         0.        ]
[0.33333333 0.         1.         0.        ]]

Build M matrix code:

def to_matrix(d):
var = get_all_variables(d)
n = len(var)
# {'A': 0, 'B': 1, 'C': 2, 'D': 3}
char2id = {var[i]:i for i in range(n)}
M = np.zeros((n,n))
# Construct M-matrix, first statistic, then average
#[A->A  B->A  C->A  D->A]
#[A-> B-> B C-> B D-> B], etc.
# Statistics
for ch in var: #For example, A
for to in d[ch]: #["B","C","D"]
y = char2id[ch]
x = char2id[to]
M[x][y] += 1
'''M:
[[0. 1. 0. 1.]
[1. 0. 0. 1.]
[1. 1. 0. 0.]
[1. 0. 1. 0.]]
'''
#Average column dimensions
for j in range(n):
sum1 = 0
for i in range(n):
sum1 += M[i][j]
if sum1>0:
for i in range(n):
M[i][j] /= sum1
'''M:
[[0.         0.5        0.         0.5       ]
[0.33333333 0.         0.         0.5       ]
[0.33333333 0.5        0.         0.        ]
[0.33333333 0.         1.         0.        ]]
'''
return np.matrix(M)

Calculation Initial weight is 1/N
As can be seen from the figure above, matrix multiplication can be used to calculate quickly.

The experimental results are as follows: Code implementation:

import json
import numpy as np
def to_matrix(d):
var = get_all_variables(d)
n = len(var)
# {'A': 0, 'B': 1, 'C': 2, 'D': 3}
char2id = {var[i]:i for i in range(n)}
M = np.zeros((n,n))
# Construct M-matrix, first statistic, then average
#[A->A  B->A  C->A  D->A]
#[A-> B-> B C-> B D-> B], etc.
# Statistics
for ch in var: #For example, A
for to in d[ch]: #["B","C","D"]
y = char2id[ch]
x = char2id[to]
M[x][y] += 1
'''M:
[[0. 1. 0. 1.]
[1. 0. 0. 1.]
[1. 1. 0. 0.]
[1. 0. 1. 0.]]
'''
#Average column dimensions
for j in range(n):
sum1 = 0
for i in range(n):
sum1 += M[i][j]
if sum1>0:
for i in range(n):
M[i][j] /= sum1
'''M:
[[0.         0.5        0.         0.5       ]
[0.33333333 0.         0.         0.5       ]
[0.33333333 0.5        0.         0.        ]
[0.33333333 0.         1.         0.        ]]
'''
return np.matrix(M)

def diedai(M, p):#iteration
#Matrix multiplication uses matmul, not multily
#(4,4) * (4,1)
return np.matmul(M,p)

def get_all_variables(d):
s = set()
#Add all elements to the collection
for k,v in d.items():
for t in v:
var = list(s)
var.sort() #['A', 'B', 'C', 'D']
return var

def page_rank(d,m):
#Iteration m times
var = get_all_variables(d)
M = to_matrix(d)
tmp = 1.0/len(var)
p = np.ones((len(var),1))*tmp
p = np.matrix(p) #matrix can be directly p[0,0] in this way
print("iteration\tPR(A)\tPR(B)\tPR(C)\tPR(D)")
print("{}\t{:.4f}\t{:.4f}\t{:.4f}\t".format(0,p[0,0],p[1,0],p[2,0],p[3,0]))
for i in range(m):
p = diedai(M,p) #Keep decimal numbers in:.4f mode
print("{}\t{:.4f}\t{:.4f}\t{:.4f}\t".format(i+1, p[0, 0], p[1, 0], p[2, 0], p[3, 0]))
'''
m=1 At that time, p:
[[0.25      ]
[0.20833333]
[0.20833333]
[0.33333333]]
'''
return p

if __name__=="__main__":
# str1 = input().strip()
str1 = '''{"A":["B","C","D"], "B":["A","C"], "C":["D"], "D":["A","B"]}'''