0

I have just started working on a content extraction project. First I am trying to the Image URLs in a webpage. In some cases, the "src" attribute of "img" has relative URL. But I need to get the complete URL.

I was looking for some Java library to achieve this and thought Jsoup will be useful. Is there any other library to achieve this easily?

Slowcoder
  • 2,060
  • 3
  • 16
  • 21
  • 1
    Not likely. You need to maintain a reference to the path yourself. You can use URL to extract the various elements of the spec to help you – MadProgrammer Feb 19 '13 at 21:00

1 Answers1

1

If you just need to get the complete URL from a relative one, the solution is simple in Java:

URL pageUrl = base_url_of_the_html_page;
String src = src_attribute_value; //relative or absolute URL
URL imgUrl = new URL(pageUrl, src);

The base URL of the HTML page is usually just the URL you have obtained the HTML code from. However, a <base> tag used in the document header, may be used for specifying a different base URL (but it's not used very frequently).

You may use Jsoup or just a DOM parser for obtaining the src attribute values and for finding the eventual base tag.

radkovo
  • 868
  • 6
  • 10