0

Given this simple code:

CrawlConfig config = new CrawlConfig();
config.setMaxDepthOfCrawling(1);
config.setPolitenessDelay(1000);
config.setResumableCrawling(false);
config.setIncludeBinaryContentInCrawling(false);
config.setCrawlStorageFolder(Config.get(Config.CRAWLER_SHARED_DIR) + "test/");
config.setShutdownOnEmptyQueue(false);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed("http://localhost/test");

controller.startNonBlocking(WebCrawler.class, 1);


long counter = 1;
while(Thread.currentThread().isAlive()) {
    System.out.println(config.toString());
    for (int i = 0; i < 4; i++) {
        System.out.println("Adding link");
        controller.addSeed("http://localhost/test" + ++counter + "/");
    }

    try {
        TimeUnit.SECONDS.sleep(5);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

Output of program is:

18:48:02.411 [main] INFO  - Obtained 6791 TLD from packaged file tld-names.txt
18:48:02.441 [main] INFO  - Deleted contents of: /home/scraper/test/frontier ( as you have configured resumable crawling to false )
18:48:02.636 [main] INFO  - Crawler 1 started
18:48:02.636 [Crawler 1] INFO  - Crawler Crawler 1 started!
Adding link
Adding link
Adding link
Adding link
18:48:02.685 [Crawler 1] WARN  - Skipping URL: http://localhost/test, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:03.642 [Crawler 1] WARN  - Skipping URL: http://localhost/test2/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:04.642 [Crawler 1] WARN  - Skipping URL: http://localhost/test3/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:05.643 [Crawler 1] WARN  - Skipping URL: http://localhost/test4/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
18:48:06.642 [Crawler 1] WARN  - Skipping URL: http://localhost/test5/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found
Adding link
Adding link
Adding link
Adding link
Adding link
Adding link
Adding link
Adding link

Why crawler4j doesn't visit test6, test7 and above?

As you can see, all 4 links before them are added and visited correctly.

When I set "http://localhost/" as seedUrl (before starting crawler), it's processing up to 13 links and then above problem occurs.

What I'm trying to obtain is a situation, when I can add urls to visit into running crawler from other thread (in runtime).

@EDIT: I've looked at thread dump by suggestion from @Seth, but I can't find out why it doesn't work.

"Thread-1" #25 prio=5 os_prio=0 tid=0x00007ff32854b800 nid=0x56e3 waiting on condition [0x00007ff2de403000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at edu.uci.ics.crawler4j.crawler.CrawlController.sleep(CrawlController.java:367)
    at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:243)
    - locked <0x00000005959baff8> (a java.lang.Object)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - None

"Crawler 1" #24 prio=5 os_prio=0 tid=0x00007ff328544000 nid=0x56e2 in Object.wait() [0x00007ff2de504000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x0000000596afdd28> (a java.lang.Object)
    at java.lang.Object.wait(Object.java:502)
    at edu.uci.ics.crawler4j.frontier.Frontier.getNextURLs(Frontier.java:151)
    - locked <0x0000000596afdd28> (a java.lang.Object)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:259)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - None
  • Why do you use a for-loop with a limit of 4? – Seth Feb 04 '16 at 12:42
  • @Seth I want to simulate adding 4 links to crawl from remote source. This number isn't significant here. – Leszek Malinowski Feb 04 '16 at 12:51
  • I noticed that, I'm currently trying to get a feel for your crawler, that's why I asked. – Seth Feb 04 '16 at 12:55
  • @Seth What I'm trying to reach is a crawler, running in separate thread, which is crawling only sites added by other thread in runtime. Crawler needs to wait for seeds, if it wasn't initialized with seeds. – Leszek Malinowski Feb 04 '16 at 13:22
  • Ah! Okay, I get it. May I ask why you do it this way? You could just collect links, wait until you have collected 10, invoke the crawling method with those links & wait again. Or is it a specific task? – Seth Feb 04 '16 at 13:27
  • @Seth I need to proccess a lot of links, crawl pages and save what is interesting for me, parse it and return product object, used by my services. Let's say I've got a file with 60 000 links to products and when you visit this link, you need to crawl at 2 depth to receive point of interests. There is a lot of files like this and each of them have other configuration (e.g. max depth). All process need to be parallel, so when I receive a link from product file, I need to add it to crawl. When crawler process this page, needs to notify other service about it, etc. – Leszek Malinowski Feb 04 '16 at 13:40
  • Have you seen this yet? A memory leak could very well be the answer to your problem: http://stackoverflow.com/questions/24807637/why-is-crawler4j-hanging-randomly?rq=1 – Seth Feb 04 '16 at 13:44
  • @Seth I've added thread dump to post, but I can't find out how to fix this. – Leszek Malinowski Feb 04 '16 at 14:28
  • Am I right: You have some running crawler threads and you want to add new URLs at runtime? – rzo1 Feb 08 '16 at 08:00
  • @rzo Yes, you're absolutely right. – Leszek Malinowski Feb 09 '16 at 10:31

1 Answers1

0

So I've found the problem. The problem was the same as this pull request