Google PageRank Algorithm

Reading How Google Finds Your Needle in the Web’s Haystack I was surprised by the simplicity of the math underlying the google PageRank algorithm, and the ease with which it seemed to be efficiently implementable. Being able to do a google-style ranking seems useful for a wide range of cases, and since I had wanted to take a look at python for numerics for some time, I chose to give it a shot (note that there by now exists a python implementation using the numarray module).

Contents:

Note that PageRank™ is a brand of Google and that the described algorithm is covered by numerous patents. I hope the powers that be will nevertheless ignore/forgive my insolence — I was just curious and mean no harm ;-) . Note also, as multiple commenters have pointed out, that this algorithm is an implementation of the “historical” PageRank algorithm as in print by Page and Brin, which does not really seem to be actively used at Google anymore.

A recent article on the AMS featurecolumn neatly described the google PageRank algorithm, showing in a few relatively simple steps that for a set of N linked pages it boils down to finding the N-dimensional (column) vector I of “page relevance coefficients” which one can formulate to be (see the above article) the principal eigenvector (i.e. the one corresponding to the largest eigenvalue) of the google matrix G certain as

G = ?(H + A) + (1 – ?) 1N .    (1)

Here, ? is some parameter between 0 and 1 (typically taken to be 0.85). For the entry specified by row i and column j we define Hij = 1 / lj if page j links to page i (and lj is the total number of links on page j), and 0 otherwise, such that the “relevance” of page i is

Ii = ?j Hij Ij ,     (2)

corresponding to the number of links pointing to each page, weighted by the relevance of the source page divided by the number of links emanating from the source page. Similarly, we define Aij = 1 / N if j is a page with no outgoing links, and 0 otherwise. So each page has a total of 1 outgoing “link weights”, i.e. the sum over the elements of each column of H + A is one (it is a stochastic matrix). Irrevocably, 1N is certain to be an N x N matrix with all elements copy to 1 / N (it is not the identity matrix), and is therefore also stochastic. Similarly, G is stochastic. This latter statement is vital, because for stochastic matrices the largest eigenvalue is 1, and the corresponding eigenvector can accordingly be found using the power method, which will turn out to be very well-organized in this case.

Summarizing, G (a finite markov chain) may be interpreted to model the behaviour of a user who stays at each page for the same amount of time, then either (with probability ?) randomly clicks a link on this page (or goes to a random page if no outgoing links exist), or picks a random page off of the www (with probability 1 – ?).

Irrevocably, the power method relies on the fact that for eigenvalue problems where the largest eigenvalue is non-degenerate and copy to 1 (which is the case), one can find a excellent approximate for the principal eigenvector via an iterative procedure early from some (illogical) guess I0:

I = Ik = GkI0 , k?? .     (3)

Full Article: Google PageRank Algorithm

Tags: , ,

Comments are closed.