Running crawler4j on multiple computers | different instances | Root Folder Lock

Question

I'm trying to implement a crawler by using crawler4j. It's running fine until:

I Run only 1 copy of it.
I run it continuously without restart.

If i restart the crawler, the url's collected are not unique. It is because, the crawler locks the root folder (that stores the intermediate crawler data & passed as an argument). When crawler restarts, it deletes the content of root data folder.

Is it possible to: ?

Prevent root data folder from locking. (So, i can run multiple copies of crawler at once.)
Content of root data folder does not delete after restart. (So that i can resume the crawler after stop.)

score 0 · Answer 1 · edited May 14 '14 at 11:02

0

You can try to alter the configuration of the crawler using:

crawlConfig.setResumableCrawling(true);

in controller.java class.

And you can follow this link and see Resumable crawling.

edited May 14 '14 at 11:02

Alexey Malev

6,408
4
34
52

answered May 14 '14 at 10:26

user3636204

1

Running crawler4j on multiple computers | different instances | Root Folder Lock

1 Answers1