Crawl urls with a certain prefix

Question

I would like to just crawl with crawler4j, certain URLs which have a certain prefix.

So for example, if an URL starts with http://url1.com/timer/image it is valid. E.g.: http://url1.com/timer/image/text.php.

This URL is not valid: http://test1.com/timer/image

I tried to implement it like that:

public boolean shouldVisit(Page page, WebURL url) {
    String href = url.getURL().toLowerCase();
    String adrs1 = "http://url1.com/timer/image";
    String adrs2 = "http://url2.com/house/image";

    if (!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))) {
        return false;
    }

    if (filters.matcher(href).matches()) {
        return false;
    }

    for (String crawlDomain : myCrawlDomains) {
        if (href.startsWith(crawlDomain)) {
            return true;
        }
    }

    return false;
}

However, it does not seem that this works, because the crawler also visits other URLs.

Any recommendation what I could so?

I appreciate your answer!

Whats happening when you provide correct urls (urls with certain prefix), do they work at all? — Bala, Sep 16 '14 at 09:01

Jama A. · Accepted Answer · 2014-09-17T21:33:05.327

Basically you can have an array of prefixes which holds allowed URLs which you want to crawl. And inside your method just travers the array return true if only it machetes with any of your allowed prefix. That means you dont have to list any domains which you don't want to crawl.

public boolean shouldVisit(Page page, WebURL url) {
    String href = url.getURL().toLowerCase();
    // prefixes that you want to crawl
    String allowedPrefixes[] = {"http://url1.com", "http://url2.com"};

    for (String allowedPrefix : allowedPrefixes) {
        if (href.startsWith(allowedPrefix)) {
            return true;
        }
     }

    return false;
}

Your code is not working because your condition is incorrect:

(!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))

Another reason is you might not have configured crawlerDomains. It is configured during startup of your application by calling CrawlController#setCustomData(crawler1Domains);

Look at sample source code of crawler4j, crawlerDomains are set here: MultipleCrawlerController.java#79

score 1 · Answer 2 · answered Sep 18 '14 at 05:21

Look at the below code. it may help you.

public boolean shouldVisit(Page page,WebURL url) {
   String href = url.getURL().toLowerCase();
   String adrs1 = "http://url1.com/timer/image";
   String adrs2 = "http://url2.com/house/image";
   return !FILTERS.matcher(href).matches() && (href.startsWith(adrs1) || href.startsWith(adrs2));
}

Crawl urls with a certain prefix

2 Answers2