Is Heritrix Crawl Deterministic?

Question

Let's say there is a website abc.com and we crawl abc.com for 100 pages as below.

Day 1: create a crawl job in heritrix by specifying maxDocumentsToDownload as 100 Day 2: clone the above job in heritrix and run.

If website doesn't change over two days of time, will I be getting same 100 pages or different set of 100 pages?

In case any more information is required please let me know

Thanks, Hareesh

score 0 · Answer 1 · answered Feb 03 '16 at 13:30

0

After cloning the job on 2nd day it will basically download same set of pages unless the website(webpages) is updated. On the other hand while running a job Heritrix tries its best not to crawl same page twice. Because abc.com and abc.com/index might point to same webp

answered Feb 03 '16 at 13:30

Girish Mane

31
1
1
6

Hi Girish, Thanks for responding. Was this documented somewhere in Herittrix documentation that there won't be any difference in set of pages crawled if website doesn't change? – TechyHarry Feb 08 '16 at 12:20
No, just based of observations I was able to tell. – Girish Mane Feb 10 '16 at 05:27

Is Heritrix Crawl Deterministic?

1 Answers1