Crawler4j, Some urls are crawled without issue while others are not crawled at all

Question

I have been playing around with Crawler4j and have successfully had it crawl some pages but have no success crawling others. For example I have gotten it to successfully crawl Reddi with this code:

public class Controller {
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "//home/user/Documents/Misc/Crawler/test";
        int numberOfCrawlers = 1;

        CrawlConfig config = new CrawlConfig();
       config.setCrawlStorageFolder(crawlStorageFolder);

        /*
         * Instantiate the controller for this crawl.
         */
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        /*
         * For each crawl, you need to add some seed urls. These are the first
         * URLs that are fetched and then the crawler starts following links
         * which are found in these pages
         */
        controller.addSeed("https://www.reddit.com/r/movies");
        controller.addSeed("https://www.reddit.com/r/politics");


        /*
         * Start the crawl. This is a blocking operation, meaning that your code
         * will reach the line after this only when crawling is finished.
         */
        controller.start(MyCrawler.class, numberOfCrawlers);
    }


}

And with:

@Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("https://www.reddit.com/");
 }

in MyCrawler.java. However when I have tried to crawl http://www.ratemyprofessors.com/ the program just hangs without output and does not crawl anything. I use the following code like above, in myController.java:

controller.addSeed("http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222");
controller.addSeed("http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044");

And in MyCrawler.java:

 @Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("http://www.ratemyprofessors.com/");
 }

So I am wondering:

Are some servers able to recognize crawlers right away and not allow them to collect data?
I noticed that the RateMyProfessor pages are .jsp format; could this have anything to do with it?
Are there any ways in which I could debug this better? The console does not output anything.

You must examine the actual network traffic being exchanged. You should learn to use Wireshark, and/or any traffic debugging options provided by crawler4j. If Crawler4j does not provide debugging options to display traffic then you will probably need to trace into the code to see what it's doing. At a minimum, pause the code in your IDE when it's hung so you can determine what it is doing at that point by examining each thread's call stack. As it stands, this question is marginally on-topic as there's not a specific question. — Jim Garrison, Dec 13 '15 at 22:19
Okay thanks. I do have some limited experience with wireshark but I am trying to teach myself to use crawlers for data collection. I was more or less wondering what possible reasons to why some urls are not "crawled" — theGuy05, Dec 13 '15 at 23:03
Could be almost anything, even a bug in Crawler4J that is not handling some condition encountered on one of those pages, such as the cookie-notice dialog that pops up on ratemyprofessors.com for new visitors. If you frequent the site you would not see it in your browser, but your crawler would. — Jim Garrison, Dec 13 '15 at 23:05

rzo1 · Accepted Answer · 2015-12-15T18:45:27.733

crawler4j respects crawler politness such as the robots.txt. In your case this file is the following one.

Inspecting this file reveals, that it is disallowed to crawl your given seed points:

 Disallow: /ShowRatings.jsp 
 Disallow: /campusRatings.jsp

This theory is supported by the crawler4j log output:

2015-12-15 19:47:18,791 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222
2015-12-15 19:47:18,793 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044

score 0 · Answer 2 · answered Jan 18 '17 at 14:35

I also have the similar issue and the error message I get is :

2017-01-18 14:18:21,136 WARN [Crawler 1] e.u.i.c.c.WebCrawler [:412] Unhandled exception while fetching http://people.com/: people.com:80 failed to respond
2017-01-18 14:18:21,140 INFO [Crawler 1] e.u.i.c.c.WebCrawler [:357] Stacktrace: org.apache.http.NoHttpResponseException: people.com:80 failed to respond

But I know for sure people.com responds to browsers.

Crawler4j, Some urls are crawled without issue while others are not crawled at all

2 Answers2