I am using Crawler4j to simply get the HTML from the crawled pages. It successfully stores the retrieved HTML for my test site of about 50 pages. It uses the shoudVisit
method I implemented, and it uses the visit
method I implemented. These both run without any problems. The files are also written with no problems. But after all the pages have been visited and stored, it doesn't stop blocking:
System.out.println("Starting Crawl");
controller.start(ExperimentCrawler.class, numberOfCrawlers);
System.out.println("finished crawl");
The second println
statement never executes. In my storage destination, the crawler has created a folder called 'frontier' that it holds a lock on (I can't delete it since the crawler is still using it).
Here are the config settings I've given it (though it doesn't seem to matter what settings I set):
config.setCrawlStorageFolder("/data/crawl/root");
config.setMaxDepthOfCrawling(1);
config.setPolitenessDelay(1000);
config.setMaxPagesToFetch(50);
config.setConnectionTimeout(500);
There is an error that appears about one minute after the crawl finishes:
java.lang.NullPointerException
at com.sleepycat.je.Database.trace(Database.java:1816)
at com.sleepycat.je.Database.sync(Database.java:489)
at edu.uci.ics.crawler4j.frontier.WorkQueues.sync(WorkQueues.java:187)
at edu.uci.ics.crawler4j.frontier.Frontier.sync(Frontier.java:182)
at edu.uci.ics.crawler4j.frontier.Frontier.close(Frontier.java:192)
at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:232)
at java.lang.Thread.run(Unknown Source)
What could be keeping the crawler from exiting? What is it writing to the 'frontier' folder?