0

I'm trying to implement a crawler by using crawler4j. It's running fine until:

  1. I Run only 1 copy of it.
  2. I run it continuously without restart.

If i restart the crawler, the url's collected are not unique. It is because, the crawler locks the root folder (that stores the intermediate crawler data & passed as an argument). When crawler restarts, it deletes the content of root data folder.

Is it possible to: ?

  1. Prevent root data folder from locking. (So, i can run multiple copies of crawler at once.)
  2. Content of root data folder does not delete after restart. (So that i can resume the crawler after stop.)
blackpanther
  • 10,998
  • 11
  • 48
  • 78
Lavneet
  • 516
  • 5
  • 19

1 Answers1

0

You can try to alter the configuration of the crawler using:

crawlConfig.setResumableCrawling(true); 

in controller.java class.

And you can follow this link and see Resumable crawling.

Alexey Malev
  • 6,408
  • 4
  • 34
  • 52