1

Has anybody experience with using Crawler4j?

I followed the example from the project page to realize my own crawler. The crawler is working fine and crawls very fast. The only thing is that I always have a delay of 20–30 seconds. Is there a way to avoid the waiting time?

The Guy with The Hat
  • 10,836
  • 8
  • 57
  • 75
user3411187
  • 61
  • 1
  • 3
  • You mean processing or waiting time? The only waiting related setting that I known about is "[politeness delay](https://code.google.com/p/crawler4j/wiki/Configurations#Politeness)". – Anthony Accioly May 02 '14 at 16:02

1 Answers1

2

Just checked crawler4j source code. The CrawerController.start method have a lot of fixed 10 seconds "pauses" going on to make sure that threads are done and ready to be cleaned up.

// Make sure again that none of the threads
// are
// alive.
logger.info("It looks like no thread is working, waiting for 10 seconds to make sure...");
sleep(10);

// ... more code ...

logger.info("No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...");
sleep(10);

// ... more code ...

logger.info("Waiting for 10 seconds before final clean up...");
sleep(10);

Also, the main loop checks every 10 seconds to know if the crawling threads are done:

while (true) {
    sleep(10);
    // code to check if some thread is still working
}

protected void sleep(int seconds) {
   try {
       Thread.sleep(seconds * 1000);
   } catch (Exception ignored) {
   }
}

So it may be worth to fine tune those calls and reduce the sleeping time.

A better solution, if you can spare some time, would be to rewrite this method. I would replace the List<Thread> threads by an ExecutorService, its awaitTermination method would be particularly handy. Unlike Sleep, awaitTermination(10, TimeUnit.SECONDS) will return immediately if all tasks are done.

Anthony Accioly
  • 21,918
  • 9
  • 70
  • 118