I am using crawler4j to crawl a website. When I visit a page, I would like to get the link text of all the links, not only the full URLs. Is this possible?
Thanks in advance.
I am using crawler4j to crawl a website. When I visit a page, I would like to get the link text of all the links, not only the full URLs. Is this possible?
Thanks in advance.
In the class where you derive from WebCrawler, get the contents of the page and then apply a regular expression.
Map<String, String> urlLinkText = new HashMap<String, String>();
String content = new String(page.getContentData(), page.getContentCharset());
Pattern pattern = Pattern.compile("<a[^>]*href=\"([^\"]*)\"[^>]*>([^<]*)</a[^>]*>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
urlLinkText.put(matcher.group(1), matcher.group(2));
}
Then stick urlLinkText somewhere that you can get to it once your crawl is complete. For example you could make it a private member of your crawler class and add a getter.