I want to extract full link from a HTML file. Full link I mean absolute links. I used Tika for this purpose. Here is my code:
URL url = new URL("http://www.domainname.com/");
InputStream input = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
ContentHandler textHandler = new BodyContentHandler();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
TeeContentHandler teeHandler = new TeeContentHandler(linkHandler,
textHandler, toHTMLHandler);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
HtmlParser parser = new HtmlParser();
parser.parse(input, teeHandler, metadata, parseContext);
System.out.println("title:\n" + metadata.get("title"));
for (Link link : linkHandler.getLinks()) {
System.out.println(link.getUri());
}
This give me relative url like /index.html or documents/US/economicreport.html but the absolute url in this case is http://domainname.com/index.html.
How can I get all the link correctly means the full link including domain name? How can I do that in Java?