Given this simple code:
CrawlConfig config = new CrawlConfig();
config.setMaxDepthOfCrawling(1);
config.setPolitenessDelay(1000);
config.setResumableCrawling(false);
config.setIncludeBinaryContentInCrawling(false);
config.setCrawlStorageFolder(Config.get(Config.CRAWLER_SHARED_DIR) + "test/");
config.setShutdownOnEmptyQueue(false);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed("http://localhost/test");
controller.startNonBlocking(WebCrawler.class, 1);
long counter = 1;
while(Thread.currentThread().isAlive()) {
System.out.println(config.toString());
for (int i = 0; i < 4; i++) {
System.out.println("Adding link");
controller.addSeed("http://localhost/test" + ++counter + "/");
}
try {
TimeUnit.SECONDS.sleep(5);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
Output of program is:
18:48:02.411 [main] INFO - Obtained 6791 TLD from packaged file tld-names.txt
18:48:02.441 [main] INFO - Deleted contents of: /home/scraper/test/frontier ( as you have configured resumable crawling to false )
18:48:02.636 [main] INFO - Crawler 1 started
18:48:02.636 [Crawler 1] INFO - Crawler Crawler 1 started!
Adding link
Adding link
Adding link
Adding link
18:48:02.685 [Crawler 1] WARN - Skipping URL: http://localhost/test, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:03.642 [Crawler 1] WARN - Skipping URL: http://localhost/test2/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:04.642 [Crawler 1] WARN - Skipping URL: http://localhost/test3/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:05.643 [Crawler 1] WARN - Skipping URL: http://localhost/test4/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:06.642 [Crawler 1] WARN - Skipping URL: http://localhost/test5/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
Adding link
Adding link
Adding link
Adding link
Adding link
Adding link
Adding link
Adding link
Why crawler4j doesn't visit test6, test7 and above?
As you can see, all 4 links before them are added and visited correctly.
When I set "http://localhost/" as seedUrl (before starting crawler), it's processing up to 13 links and then above problem occurs.
What I'm trying to obtain is a situation, when I can add urls to visit into running crawler from other thread (in runtime).
@EDIT: I've looked at thread dump by suggestion from @Seth, but I can't find out why it doesn't work.
"Thread-1" #25 prio=5 os_prio=0 tid=0x00007ff32854b800 nid=0x56e3 waiting on condition [0x00007ff2de403000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at edu.uci.ics.crawler4j.crawler.CrawlController.sleep(CrawlController.java:367)
at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:243)
- locked <0x00000005959baff8> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- None
"Crawler 1" #24 prio=5 os_prio=0 tid=0x00007ff328544000 nid=0x56e2 in Object.wait() [0x00007ff2de504000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000596afdd28> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at edu.uci.ics.crawler4j.frontier.Frontier.getNextURLs(Frontier.java:151)
- locked <0x0000000596afdd28> (a java.lang.Object)
at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:259)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- None