1

Hello Everyone I am making a web application that crawl lots of pages from a specific website, I started my crawler4j software with unlimited depth and pages but suddenly it stopped because of internet connection. Now I want to continue crawling that website and not to fetch the urls I visited before considering I have last pages depth.

Note : I want some way that not to check my stored url with the urls I will fetch because I don't want to send very much requests to this site.

**Thanks **☺

Ahmed Sakr
  • 129
  • 1
  • 9

1 Answers1

2

You can use "resumeable" crawling with crawler4j by enabling this feature

crawlConfig.setResumableCrawling(true);

in the given configuration. See the documentation of crawler4j here.

rzo1
  • 5,561
  • 3
  • 25
  • 64
  • Great , but how this method work?,what logic it uses? – Ahmed Sakr Dec 12 '18 at 00:09
  • If it is enabled, it uses the internal berkley database to store intermediate crawl data (frontier, docid server) in the location you specified by setting the crawl storage folder. – rzo1 Dec 12 '18 at 08:00