6

I come up with an interview question that I would like to know your opinion about that. The questions are say that in designing a web crawler:

1) what kind of pages will you hit with a DFS versus BFS?

2) how would you avoid getting into infinite loops?

I appreciate if somebody could answer them.

Dominique Fortin
  • 2,212
  • 15
  • 20
Nazgol
  • 171
  • 2
  • 8
  • I think this post answers the questions pretty well: http://stackoverflow.com/questions/4356524/web-cralwer-algorithm-depth – megawac Dec 14 '13 at 03:13
  • But my question is that for what kind of pages we prefer DFS over BFS.right? – Nazgol Dec 14 '13 at 04:03
  • possible duplicate of [Designing a web crawler](http://stackoverflow.com/questions/5834808/designing-a-web-crawler) – Devolus Dec 14 '13 at 08:35

1 Answers1

10

1) what kind of pages will you hit with a DFS versus BFS?

In most situations, I will use BFS algorithm to implement a spider because most valuable info I want to get from web pages doesn't have much link depth, otherwise I think the site is not much valuable to crawl because of the bad design.

If I want to get some specific data from one page and other related data from a few hops and at the same time I want to see the results soon after the spider runs, then I may choose DFS algorithm. Say, I want to get all the tags from stackoverflow. The tag page is here. At the same time, I want to get who answer what questions in the tag. And I want to check whether the spider runs properly. Then I use DFS algorithm to get the data tag-questions-answers soon after the spider runs.

In a word, it depends on the usage scenario.

2) how would you avoid getting into infinite loops?

This question may be simple. Solutions are such as:

  • Use MAX LINK DEPTH.
  • Record urls that you have crawled and before emit a new request, check whether the url has been crawled.

I remember scrapy seems could solve the second question. You could read its source code to search for a better solution.

Community
  • 1
  • 1
flyer
  • 9,280
  • 11
  • 46
  • 62