crawler4j crawls only seed URLs

Question

Why does the following code build upon crawler4j only crawl the given seed URLs and does not start to crawl other links?

public static void main( String[] args )
{
      String crawlStorageFolder = "F:\\crawl";
      int numberOfCrawlers = 7;

      CrawlConfig config = new CrawlConfig();
      config.setCrawlStorageFolder(crawlStorageFolder);
      config.setMaxDepthOfCrawling(4);
      /*
       * Instantiate the controller for this crawl.
       */
      PageFetcher pageFetcher = new PageFetcher(config);

      RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
      robotstxtConfig.setEnabled(false);

      RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
      CrawlController controller = null;
        try {
            controller = new CrawlController(config, pageFetcher, robotstxtServer);
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

      /*
       * For each crawl, you need to add some seed urls. These are the first
       * URLs that are fetched and then the crawler starts following links
       * which are found in these pages
       */
      controller.addSeed("http://edition.cnn.com/2016/05/11/politics/paul-ryan-donald-trump-meeting/index.html");        

      /*
       * Start the crawl. This is a blocking operation, meaning that your code
       * will reach the line after this only when crawling is finished.
       */
      controller.start(MyCrawler.class, numberOfCrawlers);

  }

Could you post the code of the `shouldVisit` method in MyCrawler.class ? — rzo1, May 13 '16 at 10:08
thanks, my bad. in the example shouldVisit contained their site as "must have" domain - missed that - now everything is working :) — user1025852, May 13 '16 at 11:55

score 3 · Answer 1 · answered May 23 '16 at 12:45

The official example is limited to the www.ics.uci.edu domain. Therefore, the shouldVisit method in the extending Crawler class needs to be adapted.

 /**
   * You should implement this function to specify whether the given url
   * should be crawled or not (based on your crawling logic).
   */
  @Override
  public boolean shouldVisit(Page referringPage, WebURL url) {
    String href = url.getURL().toLowerCase();
    // Ignore the url if it has an extension that matches our defined set of image extensions.
    if (IMAGE_EXTENSIONS.matcher(href).matches()) {
      return false;
    }

    // Only accept the url if it is in the "www.ics.uci.edu" domain and protocol is "http".
    return href.startsWith("http://www.ics.uci.edu/");
  }

crawler4j crawls only seed URLs

1 Answers1