I would like to just crawl with crawler4j
, certain URLs which have a certain prefix.
So for example, if an URL starts with http://url1.com/timer/image
it is valid. E.g.: http://url1.com/timer/image/text.php
.
This URL is not valid: http://test1.com/timer/image
I tried to implement it like that:
public boolean shouldVisit(Page page, WebURL url) {
String href = url.getURL().toLowerCase();
String adrs1 = "http://url1.com/timer/image";
String adrs2 = "http://url2.com/house/image";
if (!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))) {
return false;
}
if (filters.matcher(href).matches()) {
return false;
}
for (String crawlDomain : myCrawlDomains) {
if (href.startsWith(crawlDomain)) {
return true;
}
}
return false;
}
However, it does not seem that this works, because the crawler also visits other URLs.
Any recommendation what I could so?
I appreciate your answer!