5

I am trying to find the internal page rank of Wikipedia using Mapreduce. I implemented my Pagerank algorithm on a small subset of wikipages. There are 6349 pages. I used this formula to calculate the pagerank (d = 0.85).

enter image description here

I wanted to verify if the sum of all the pagerank is equal to the total number of pages(6349).

What I found so far:

1.The total page rank of all the 6349 pages is 1001.26044

2.According to WikiPedia if I use the above formula then each PageRank is multiplied by N and the sum becomes N. I multiplied each page rank by N (6349) and calculated the sum, I got 6356789.5.

Is there a reason why the sum of page ranks is not equal to the total number of pages? Should I use the second formula to verify ?

enter image description here

Note: I ran my mapreduce code for 10 iterations to get a good approximation.

yesh
  • 2,052
  • 4
  • 28
  • 51

2 Answers2

6

As I suppose, you have too few iterations. Why 10? Why 100? Or 100000? You should count, what are the mediums or maximums of the two last changes. And thus evaluate the possible error.

And the PR is a probability. The sum of all of them should be 1! The sentence "sum of all the pagerank is equal to the total number of pages" is wrong.

As for another formula, it belongs to another model and another PR. Of course, you can use it too. Or both. But you can't check using it.

Gangnus
  • 24,044
  • 16
  • 90
  • 149
  • You want me to calculate the difference between the total page rank of the last 2 iterations ? I din't quite understand what you meant by medium or maximal. How will this help to evaluate the possible error ? – yesh Nov 27 '12 at 13:39
  • You don't know the real PR, remember? So, you only could guess how close you are to it by comparison of the results of the consequent iterations. But these results are not numbers, they are vectors of 6k members. So, if you want to compare them, you have to choose some measure - medium difference or maximal difference. – Gangnus Nov 27 '12 at 14:44
  • If you have maximal differences as: 1/10, 1/20, 1/40, 1/80 ... than you can surely guess the real error of the last iteration as 1/80. – Gangnus Nov 27 '12 at 14:46
-1

it depends what base you choose (default is 1). After each iteration you have to calculate

delta = (base - sum_of_ranks) / N

And then decrease each rank by delta. Only in this way you will keep you ranks alive until the end last iteration.

Andrew Barber
  • 39,603
  • 20
  • 94
  • 123
Andrii Dvoiak
  • 427
  • 4
  • 6