Crawler4j keeps blocking after crawl

Question

I am using Crawler4j to simply get the HTML from the crawled pages. It successfully stores the retrieved HTML for my test site of about 50 pages. It uses the shoudVisit method I implemented, and it uses the visit method I implemented. These both run without any problems. The files are also written with no problems. But after all the pages have been visited and stored, it doesn't stop blocking:

System.out.println("Starting Crawl");
controller.start(ExperimentCrawler.class, numberOfCrawlers);
System.out.println("finished crawl");

The second println statement never executes. In my storage destination, the crawler has created a folder called 'frontier' that it holds a lock on (I can't delete it since the crawler is still using it).

Here are the config settings I've given it (though it doesn't seem to matter what settings I set):

config.setCrawlStorageFolder("/data/crawl/root");
config.setMaxDepthOfCrawling(1);
config.setPolitenessDelay(1000);
config.setMaxPagesToFetch(50);
config.setConnectionTimeout(500);

There is an error that appears about one minute after the crawl finishes:

java.lang.NullPointerException at com.sleepycat.je.Database.trace(Database.java:1816) at com.sleepycat.je.Database.sync(Database.java:489) at edu.uci.ics.crawler4j.frontier.WorkQueues.sync(WorkQueues.java:187) at edu.uci.ics.crawler4j.frontier.Frontier.sync(Frontier.java:182) at edu.uci.ics.crawler4j.frontier.Frontier.close(Frontier.java:192) at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:232) at java.lang.Thread.run(Unknown Source)

What could be keeping the crawler from exiting? What is it writing to the 'frontier' folder?

It certainly is a useful piece of information, but I don't want to mark an answer as accepted before I try it. I'll give the new version a chance when I can. It's been a while, so I'm no longer working on the same project. — Indigenuity, Aug 27 '15 at 17:15

score 1 · Answer 1 · answered Aug 24 '15 at 14:13

You are using an old version of crawler4j.

The bug you are mentioning is very irritating, and is actually a bug in the internalDB crawler4j is using: BerklyDB.

Crawler4j, uses internally the frontier directory and you shouldn't worry or touch it, as it is only for internal use.

All of the above being said - I have fixed that bug, and you should download the latest version of crawler4j which contains my bugfixes (lots of bugfixes including your mentioned one).

So please go to our new site: https://github.com/yasserg/crawler4j

Follow the instructions about installing it (maven?) And enjoy the new and very improved version.

The external API almost didn't change (only really slightly).

Enjoy the new (currently v4.1) version.

Crawler4j keeps blocking after crawl

1 Answers1