Retrieving absolute URL from a webpage

Question

I want to extract full link from a HTML file. Full link I mean absolute links. I used Tika for this purpose. Here is my code:

URL url = new URL("http://www.domainname.com/");
InputStream input = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
ContentHandler textHandler = new BodyContentHandler();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
TeeContentHandler teeHandler = new TeeContentHandler(linkHandler,
        textHandler, toHTMLHandler);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
HtmlParser parser = new HtmlParser();
parser.parse(input, teeHandler, metadata, parseContext);
System.out.println("title:\n" + metadata.get("title"));
for (Link link : linkHandler.getLinks()) {
    System.out.println(link.getUri());
}

This give me relative url like /index.html or documents/US/economicreport.html but the absolute url in this case is http://domainname.com/index.html.

How can I get all the link correctly means the full link including domain name? How can I do that in Java?

How are the links written in the HTML page? If they are relative there, it's not so strange that you get relative links from the parser as well, is it? — Simon Forsberg, Oct 05 '13 at 10:36
*"Sorry Andrew"* No need for apologies, far better to.. *"I will remember this in future."* ..offer an assurance of future action. :) — Andrew Thompson, Oct 05 '13 at 10:54

score 0 · Answer 1 · answered Oct 05 '13 at 10:47

0

If you have stored the base website URL in url, the following should work:

URL url = new URL("http://www.domainname.com/");
String givenUrl = ""; //This is the parsed address

if (givenUrl.charAt(0) == '/') {
    String absoluteUrl = url + givenURL;
} else {
    String absoluteUrl = givenUrl;
}

answered Oct 05 '13 at 10:47

Ron

1,450
15
27

2

Your `absoluteUrl` is inaccessible (and therefore will be completely removed by the compiler) outside your if-else statement – Germann Arlington Oct 05 '13 at 11:47

Sully · Answer 2 · 2015-05-29T07:32:01.823

Slightly better than the previous, but only slightly, is

URL targetDocumentUrl = new URL("http://www.domainname.com/content.html");
String parsedUrl = link.getURI();
String absoluteLink = new URL(targetDocumentUrl, parsedURL);

However, it is still not a good solution as it has problems when the html document has the following tag base href="/" and the link being parsed is relative and starts with "../".

Of course you can get around this a number of ways but they involve a bit of work such as implementing a ContentHandler. I have to think for something so basic there must be a simple way to do this with the Tika LinkContentHandler.

Retrieving absolute URL from a webpage

2 Answers2