2

I'd like to know how many public pages there are in a site, say for example, smashingmagzine.com. Is there are way to count the number of pages?

Gaurav Sharma
  • 4,032
  • 14
  • 46
  • 72

3 Answers3

3

You can query Google's index using the site operator. e.g:

site:domain-to-query.com

This will return a list of the pages from the site that are currently indexed by Google. Other search engines provide similar functionality but I don't know the syntax off hand.

Of course not all pages may be indexed, and the index may contain pages which no longer exist.

duncmc
  • 968
  • 5
  • 9
2

You need to basically crawl the site. Your process would be something like:

  • Start at root domain / homepage
  • Look for all links that point within the same domain
  • For each of those links, repeat the steps

Your loop terminates when there are no more links to crawl that are pointing in the same domain. Remember to stay in the site otherwise you'll start crawling external sites.

You can also try parsing the sitemap if they provide one.

One tool that might prove useful if using Java is JSpider or Sphider in PHP.

NG.
  • 22,560
  • 5
  • 55
  • 61
0

You'll need to recursively scan the markup of each page, starting with your top level page, looking for any kind of links to other pages, and recursively crawl through them. You'll also need to keep track of what has been scanned as to not get caught in an infinate loop.

George Johnston
  • 31,652
  • 27
  • 127
  • 172