I'm trying to implement a crawler by using crawler4j. It's running fine until:
- I Run only 1 copy of it.
- I run it continuously without restart.
If i restart the crawler, the url's collected are not unique. It is because, the crawler locks the root folder (that stores the intermediate crawler data & passed as an argument). When crawler restarts, it deletes the content of root data folder.
Is it possible to: ?
- Prevent root data folder from locking. (So, i can run multiple copies of crawler at once.)
- Content of root data folder does not delete after restart. (So that i can resume the crawler after stop.)