0

I want to extract full link from a HTML file. Full link I mean absolute links. I used Tika for this purpose. Here is my code:

URL url = new URL("http://www.domainname.com/");
InputStream input = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
ContentHandler textHandler = new BodyContentHandler();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
TeeContentHandler teeHandler = new TeeContentHandler(linkHandler,
        textHandler, toHTMLHandler);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
HtmlParser parser = new HtmlParser();
parser.parse(input, teeHandler, metadata, parseContext);
System.out.println("title:\n" + metadata.get("title"));
for (Link link : linkHandler.getLinks()) {
    System.out.println(link.getUri());
}

This give me relative url like /index.html or documents/US/economicreport.html but the absolute url in this case is http://domainname.com/index.html.

How can I get all the link correctly means the full link including domain name? How can I do that in Java?

Valerij
  • 27,090
  • 1
  • 26
  • 42
Alex
  • 1,406
  • 2
  • 18
  • 33

2 Answers2

0

If you have stored the base website URL in url, the following should work:

URL url = new URL("http://www.domainname.com/");
String givenUrl = ""; //This is the parsed address

if (givenUrl.charAt(0) == '/') {
    String absoluteUrl = url + givenURL;
} else {
    String absoluteUrl = givenUrl;
}
Ron
  • 1,450
  • 15
  • 27
0

Slightly better than the previous, but only slightly, is

URL targetDocumentUrl = new URL("http://www.domainname.com/content.html");
String parsedUrl = link.getURI();
String absoluteLink = new URL(targetDocumentUrl, parsedURL);

However, it is still not a good solution as it has problems when the html document has the following tag base href="/" and the link being parsed is relative and starts with "../".

Of course you can get around this a number of ways but they involve a bit of work such as implementing a ContentHandler. I have to think for something so basic there must be a simple way to do this with the Tika LinkContentHandler.

Sully
  • 494
  • 1
  • 5
  • 12