Just checked crawler4j source code. The CrawerController.start method have a lot of fixed 10 seconds "pauses" going on to make sure that threads are done and ready to be cleaned up.
// Make sure again that none of the threads
// are
// alive.
logger.info("It looks like no thread is working, waiting for 10 seconds to make sure...");
sleep(10);
// ... more code ...
logger.info("No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...");
sleep(10);
// ... more code ...
logger.info("Waiting for 10 seconds before final clean up...");
sleep(10);
Also, the main loop checks every 10 seconds to know if the crawling threads are done:
while (true) {
sleep(10);
// code to check if some thread is still working
}
protected void sleep(int seconds) {
try {
Thread.sleep(seconds * 1000);
} catch (Exception ignored) {
}
}
So it may be worth to fine tune those calls and reduce the sleeping time.
A better solution, if you can spare some time, would be to rewrite this method. I would replace the List<Thread> threads
by an ExecutorService, its awaitTermination method would be particularly handy. Unlike Sleep, awaitTermination(10, TimeUnit.SECONDS)
will return immediately if all tasks are done.