PageRank - Principle and Code Analysis

Preface

Because recently to prepare for graduation design, the paper design to extract the key words in sentences, to use the TextRank algorithm. It was improved by PageRank.

source

Used in search engines to rank the importance of web pages.
The founder of the book, Mr. Page and Brin, referred to the academic method of evaluating the importance of papers. The more times they quoted, the more important they were.
PageRank's ideas are similar:

  • If a page is linked by many other pages, it is important and has high weight.
  • If a page with high PageRank value links to a page, the weight of the page is also increased.

Model

Web pages are abstracted into nodes. For example, four pages A,B,C,D. Pictured

For A, A points to B, C and D. Because they are divided into three points, each branch can only calculate 1/3 of the weight.
The input format is in json format, as follows:

{"A":["B","C","D"], "B":["A","C"], "C":["D"], "D":["A","B"]}

Write the above figure in the form of transition matrix M:
M=[A→AB→AC→AD→AA→BB→BC→BD→BA→CB→CC→CD→CA→DB→DC→DD→D] M = \begin{bmatrix} A\rightarrow A & B\rightarrow A & C\rightarrow A &D\rightarrow A \\ A\rightarrow B & B\rightarrow B & C\rightarrow B &D\rightarrow B \\ A\rightarrow C & B\rightarrow C & C\rightarrow C &D\rightarrow C \\ A\rightarrow D & B\rightarrow D & C\rightarrow D &D\rightarrow D \\ \end{bmatrix} M=⎣⎢⎢⎡​A→AA→BA→CA→D​B→AB→BB→CB→D​C→AC→BC→CC→D​D→AD→BD→CD→D​⎦⎥⎥⎤​
The corresponding example M is:

    [[0.         0.5        0.         0.5       ]
     [0.33333333 0.         0.         0.5       ]
     [0.33333333 0.5        0.         0.        ]
     [0.33333333 0.         1.         0.        ]]

Build M matrix code:

def to_matrix(d):
    var = get_all_variables(d)
    n = len(var)
    # {'A': 0, 'B': 1, 'C': 2, 'D': 3}
    char2id = {var[i]:i for i in range(n)}
    M = np.zeros((n,n))
    # Construct M-matrix, first statistic, then average
    #[A->A  B->A  C->A  D->A]
    #[A-> B-> B C-> B D-> B], etc.
    # Statistics
    for ch in var: #For example, A
        for to in d[ch]: #["B","C","D"]
            y = char2id[ch]
            x = char2id[to]
            M[x][y] += 1
    '''M:
    [[0. 1. 0. 1.]
     [1. 0. 0. 1.]
     [1. 1. 0. 0.]
     [1. 0. 1. 0.]]
    '''
    #Average column dimensions
    for j in range(n):
        sum1 = 0
        for i in range(n):
            sum1 += M[i][j]
        if sum1>0:
            for i in range(n):
                M[i][j] /= sum1
    '''M:
    [[0.         0.5        0.         0.5       ]
     [0.33333333 0.         0.         0.5       ]
     [0.33333333 0.5        0.         0.        ]
     [0.33333333 0.         1.         0.        ]]
    '''
    return np.matrix(M)

Calculation


Initial weight is 1/N
As can be seen from the figure above, matrix multiplication can be used to calculate quickly.

The experimental results are as follows:

Code implementation:

import json
import numpy as np
def to_matrix(d):
    var = get_all_variables(d)
    n = len(var)
    # {'A': 0, 'B': 1, 'C': 2, 'D': 3}
    char2id = {var[i]:i for i in range(n)}
    M = np.zeros((n,n))
    # Construct M-matrix, first statistic, then average
    #[A->A  B->A  C->A  D->A]
    #[A-> B-> B C-> B D-> B], etc.
    # Statistics
    for ch in var: #For example, A
        for to in d[ch]: #["B","C","D"]
            y = char2id[ch]
            x = char2id[to]
            M[x][y] += 1
    '''M:
    [[0. 1. 0. 1.]
     [1. 0. 0. 1.]
     [1. 1. 0. 0.]
     [1. 0. 1. 0.]]
    '''
    #Average column dimensions
    for j in range(n):
        sum1 = 0
        for i in range(n):
            sum1 += M[i][j]
        if sum1>0:
            for i in range(n):
                M[i][j] /= sum1
    '''M:
    [[0.         0.5        0.         0.5       ]
     [0.33333333 0.         0.         0.5       ]
     [0.33333333 0.5        0.         0.        ]
     [0.33333333 0.         1.         0.        ]]
    '''
    return np.matrix(M)

def diedai(M, p):#iteration
    #Matrix multiplication uses matmul, not multily
    #(4,4) * (4,1)
    return np.matmul(M,p)

def get_all_variables(d):
    s = set()
    #Add all elements to the collection
    for k,v in d.items():
        s.add(k)
        for t in v:
            s.add(t)
    var = list(s)
    var.sort() #['A', 'B', 'C', 'D']
    return var

def page_rank(d,m):
    #Iteration m times
    var = get_all_variables(d)
    M = to_matrix(d)
    tmp = 1.0/len(var)
    p = np.ones((len(var),1))*tmp
    p = np.matrix(p) #matrix can be directly p[0,0] in this way
    print("iteration\tPR(A)\tPR(B)\tPR(C)\tPR(D)")
    print("{}\t{:.4f}\t{:.4f}\t{:.4f}\t".format(0,p[0,0],p[1,0],p[2,0],p[3,0]))
    for i in range(m):
        p = diedai(M,p) #Keep decimal numbers in:.4f mode
        print("{}\t{:.4f}\t{:.4f}\t{:.4f}\t".format(i+1, p[0, 0], p[1, 0], p[2, 0], p[3, 0]))
    '''
    m=1 At that time, p:
    [[0.25      ]
     [0.20833333]
     [0.20833333]
     [0.33333333]]
     '''
    return p


if __name__=="__main__":
    # str1 = input().strip()
    str1 = '''{"A":["B","C","D"], "B":["A","C"], "C":["D"], "D":["A","B"]}'''
    d = json.loads(str1)
    res = page_rank(d,10)

supplement

This is the simplest form of answer, and it actually does a smoothing job:
d Often 0.85,
PR(A)=(1−d)+d(PR(T1))L(T1))+...+PR(Tn)L(Tn)) PR(A) = (1-d) + d(\frac{PR(T_{1}))}{L(T_{1}))} + ... + \frac{PR(T_{n})}{L(T_{n})}) PR(A)=(1−d)+d(L(T1​))PR(T1​))​+...+L(Tn​)PR(Tn​)​)

Reference resources: PageRank Notes

Tags: JSON

Posted on Mon, 07 Oct 2019 04:37:36 -0700 by agulaid