1

Why does the following code build upon crawler4j only crawl the given seed URLs and does not start to crawl other links?

public static void main( String[] args )
{
      String crawlStorageFolder = "F:\\crawl";
      int numberOfCrawlers = 7;

      CrawlConfig config = new CrawlConfig();
      config.setCrawlStorageFolder(crawlStorageFolder);
      config.setMaxDepthOfCrawling(4);
      /*
       * Instantiate the controller for this crawl.
       */
      PageFetcher pageFetcher = new PageFetcher(config);

      RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
      robotstxtConfig.setEnabled(false);

      RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
      CrawlController controller = null;
        try {
            controller = new CrawlController(config, pageFetcher, robotstxtServer);
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

      /*
       * For each crawl, you need to add some seed urls. These are the first
       * URLs that are fetched and then the crawler starts following links
       * which are found in these pages
       */
      controller.addSeed("http://edition.cnn.com/2016/05/11/politics/paul-ryan-donald-trump-meeting/index.html");        

      /*
       * Start the crawl. This is a blocking operation, meaning that your code
       * will reach the line after this only when crawling is finished.
       */
      controller.start(MyCrawler.class, numberOfCrawlers);

  }
Tony Hinkle
  • 4,706
  • 7
  • 23
  • 35
user1025852
  • 2,684
  • 11
  • 36
  • 58

1 Answers1

3

The official example is limited to the www.ics.uci.edu domain. Therefore, the shouldVisit method in the extending Crawler class needs to be adapted.

 /**
   * You should implement this function to specify whether the given url
   * should be crawled or not (based on your crawling logic).
   */
  @Override
  public boolean shouldVisit(Page referringPage, WebURL url) {
    String href = url.getURL().toLowerCase();
    // Ignore the url if it has an extension that matches our defined set of image extensions.
    if (IMAGE_EXTENSIONS.matcher(href).matches()) {
      return false;
    }

    // Only accept the url if it is in the "www.ics.uci.edu" domain and protocol is "http".
    return href.startsWith("http://www.ics.uci.edu/");
  }
rzo1
  • 5,561
  • 3
  • 25
  • 64