6

I am learning about the PageRanking algorithm so sorry for some newbie questions. I understand that the PR value is calculated for each page by the summation of incoming links to itself.

Now I am bothered by a statement which stated that "the PageRank values sum to one " at wikipedia.

As the example shown at wikipedia, if every page has a outbound link, then the summation of whole probabilities from each page should be one. However, if a page does not have any outbound link such as page A at the example, then the summation should not be value 1 right ?

Thus, does Pagerank algorithm have to assume that every page has at least one outbound link ? Could someone elaborate more how Pageranking deal with pages without any incoming or outbound links ? How will the formulas change accordingly ? Thanks

amit
  • 175,853
  • 27
  • 231
  • 333
Cassie
  • 1,179
  • 6
  • 18
  • 30
  • 4
    @RaymondChen This does not belong to webmasters at all. It is a question abount how to handle an edge case of a well known algorithm, and is very related to programming. A programming question does not have to be "how to parse a string in C?" it can also be about concepts of algorithms, that when understood can be very easily translated to any programming language. – amit Feb 02 '14 at 07:13
  • 1
    @RaymondChen As the most active member (or at least one of them) in SO in the 'algorithm' tag community, I disagree with you, and I think it is a perfectly fine question for SO. Not too theoretical because the question asked is a practical issue when building the graph invoking the page-rank algorithm on it, and as said - can be very easily translated to any programming language. – amit Feb 02 '14 at 07:22

2 Answers2

15

As page-rank is described in the original article, and in the wikipedia article, it is indeed not defined when out-degree(v)=0 for some v, since you get P(v,u)=d/n+(1-d)*0/0 - which is undefined

A node that has no outgoing edge is called a dangling node and there are basically 3 common ways to take care of them:

  1. Eliminate such nodes from the graph (and repeat the process iteratively until there are no dangling nodes.
  2. Consider those pages to link back to the pages that linked to them (i.e. - for each edge (u,v), if out-degree(v) = 0, regard (v,u) as an edge).
  3. Link the dangling node to all pages (including itself usually), and effectively make the probability for random jump from this node 1.

About a page with no incoming node - that shouldn't be an issue because everything is perfectly defined. Such a node will have a page rank of exactly d/n - because you can only get to it by random surfing from any node - and that's the probability to be in it.

Hope that answered your question!

amit
  • 175,853
  • 27
  • 231
  • 333
  • http://webcourse.cs.technion.ac.il/236375/Winter2013-2014/ho/WCFiles/lec2-linkAnalysisIntro.pdf slide 28 – Shmoopy Feb 03 '14 at 15:32
  • @Shmoopy Thanks, it is indeed taken from there. I am the Teaching Assistant in this course, so I must say I am familiar with it. – amit Feb 03 '14 at 15:48
2

The PageRank algorithm ranks a page based on the incoming links to that page. The outbound links from that page help determine the PageRank of the other pages to which it links. This process is iterated repeatedly to determine PageRank.

In each iteration, value is added to page A's PageRank if there are incoming links from other pages. The value added to page A is the PageRank of page B, which contains the incoming link to page A, divided by the total number of outgoing links on page B.

Therefore, having no outbound links will not affect the PageRank of page A. The impact of having no outbound links is only that page A will not add value to the PageRank of any other pages. By contrast, if there are no incoming links to page B, it will have the baseline (very low) PageRank, because it never gets added value from incoming links.

davemb83
  • 56
  • 3
  • Thank you very much for reply. I understand the majority of the process. the confusions I have are (1) what is the PR(D) value at the second iteration here ? Page D happens to not have any incoming link. (2) the PR(A) + PR(B) + PR(C) + PR(D) will not sum up to 1 as the statement stated. What did I do wrong here ? Thanks, – Cassie Feb 02 '14 at 06:27
  • 1
    Sorry I misunderstood the depth of your question. Glad to see it was answered for you! For the sake of completeness, it seems that the Wikipedia article addresses the issue of pages without outbound links, or "sinks." The article states, "When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection" (the 3rd option outlined in amit's reply). As amit also pointed out, the PageRank for page D, which has no incoming links, should be calculable. It would be the "baseline" of randomly getting the page, which I believe is (1-d)/N. – davemb83 Feb 02 '14 at 13:49