Get link text of links when crawling a website using crawler4j

Question

I am using crawler4j to crawl a website. When I visit a page, I would like to get the link text of all the links, not only the full URLs. Is this possible?

Thanks in advance.

score 1 · Accepted Answer · answered Jun 14 '12 at 04:46

In the class where you derive from WebCrawler, get the contents of the page and then apply a regular expression.

Map<String, String> urlLinkText = new HashMap<String, String>();
String content = new String(page.getContentData(), page.getContentCharset());
Pattern pattern = Pattern.compile("<a[^>]*href=\"([^\"]*)\"[^>]*>([^<]*)</a[^>]*>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
    urlLinkText.put(matcher.group(1), matcher.group(2));
}

Then stick urlLinkText somewhere that you can get to it once your crawl is complete. For example you could make it a private member of your crawler class and add a getter.

Get link text of links when crawling a website using crawler4j

1 Answers1