0

Let's say there is a website abc.com and we crawl abc.com for 100 pages as below.

Day 1: create a crawl job in heritrix by specifying maxDocumentsToDownload as 100 Day 2: clone the above job in heritrix and run.

If website doesn't change over two days of time, will I be getting same 100 pages or different set of 100 pages?

In case any more information is required please let me know

Thanks, Hareesh

TechyHarry
  • 301
  • 2
  • 8
  • 25

1 Answers1

0

After cloning the job on 2nd day it will basically download same set of pages unless the website(webpages) is updated. On the other hand while running a job Heritrix tries its best not to crawl same page twice. Because abc.com and abc.com/index might point to same webp

Girish Mane
  • 31
  • 1
  • 1
  • 6
  • Hi Girish, Thanks for responding. Was this documented somewhere in Herittrix documentation that there won't be any difference in set of pages crawled if website doesn't change? – TechyHarry Feb 08 '16 at 12:20
  • No, just based of observations I was able to tell. – Girish Mane Feb 10 '16 at 05:27