What is the safest way to convert scraped URLs to real URLs?

Question

I scrape a website and find these links on a page:

index.html
bla.html
/index.html
A.com/test.html
http://wwww.B.com/bla.html

If I know the current page is www.A.com/some/path, how can I convert these links into "real Urls" effectively. So, in each case, the urls should translate to:

index.html => http://www.A.com/some/path/index.html
bla.html => http://www.A.com/some/path/bla.html
/index.html => http://www.A.com/index.html
A.com/test.html => http://www.A.com/test.html
http://wwww.B.com/bla.html => http://wwww.B.com/bla.html

What is the most effective way to convert these on-page links to their fully qualified url names?

Do you have any attempt? – Leandro Bardelli Nov 09 '14 at 01:54 — Leandro Bardelli, Nov 09 '14 at 01:54

score 1 · Answer 1 · answered Nov 09 '14 at 01:55

Use the java.net.URL class:

URL BASE_PATH = new URL("http://www.A.com/some/path");
String RELATIVE_PATH = "index.html";
URL absolute = new URL(BASE_PATH, RELATIVE_PATH);

It will resolve the relative URL against the base path. If the relative URL is actually an absolute URL, it will return it instead.

score 1 · Answer 2 · answered Nov 09 '14 at 03:28

@Brigham's Answer is correct but incomplete.

The problem is that the page where you scraped the URLs from could include a <base> element in the <head>. This base URL may be significantly different to the URL that you fetched the page from.

For example:

<!DOCTYPE html> 
<html>
  <head>
    <base href="http://www.example.com/">
    ...
  </head>
  <body>
    ...
  </body>
</html>

In the ... sections, any relative URLs will be resolved relative to the base URL rather than the original page URL.

What this means that if you want to resolve "scraped" URLs correctly in all cases, you also need to look for any <base> elements as you are "scraping".

What is the safest way to convert scraped URLs to real URLs?

2 Answers2