-1

i need a sitemap which can help to people and google to know pages as well. I've tried WebSphinx application.

I realize if I put wikipedia.org as the starting URL, it will not crawl further.

Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?

Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?

raminmm
  • 1
  • 1

1 Answers1

0

Crawling wikipedia is a bad idea. It is hundreds of TBs of data uncompressed. I would suggest offline crawling by using various dumps provided by wikipedia. Find them here https://dumps.wikimedia.org/

You can create a sitemap for wikipedia using page meta information, external links, interwikilinks and redirects databases to name a few.

Shreyas Chavan
  • 1,079
  • 1
  • 7
  • 17