1

I'm trying to build a Web Crawler in Java, and I'm wondering if there is any way I can get the relative path from an absolute path given the base url. I'm trying to replace any absolute paths in the html under the same domain.

As the http urls contains unsafe characters, I was not able to use Java URI as described in How to construct a relative path in Java from two absolute paths (or URLs)?.

I'm using jsoup to parse my html and it seems that it is able to get absolute path from relative, but not the other way round.

E.g. In a particular html of the following html,

"http://www.example.com/mysite/base.html"

In the page source of base.html, it can contains:

'<a href="http://www.example.com/myanothersite/new.html"> Another site of mine </a>

I am trying to cache this base.html, and edit it such that it now contains:

'<a href="../myanothersite/new.html">Another site of mine</a>
Community
  • 1
  • 1
Wee
  • 278
  • 4
  • 16
  • So, you have "http://www.example.com/mysite/whatever" as base and want to have all sites that start with that, relative to it? Or relative to what? – Angelo Fuchs Sep 30 '13 at 11:31
  • Yes. Basically I want to change all absolute urls in that particular html to become relative url using that particular html url as base. – Wee Sep 30 '13 at 11:42
  • Please revisit my guess at what you want to have in your question. – Angelo Fuchs Sep 30 '13 at 11:50
  • possible duplicate of [How to construct a relative path in Java from two absolute paths (or URLs)?](http://stackoverflow.com/questions/204784/how-to-construct-a-relative-path-in-java-from-two-absolute-paths-or-urls) – Raedwald Sep 30 '13 at 12:07

2 Answers2

2

A different approach that does not need a given baseUrl and uses more advanced methods.

    String sourceUrl = "http://www.example.com/mysite/whatever/somefolder/bar/unsecure!+?#whätyöühäv€it/site.html"; // your current site
    String targetUrl = "http://www.example.com/mysite/whatever/otherfolder/other.html"; // the link target
    String expectedTarget = "../../../otherfolder/other.html";
    String[] sourceElements = sourceUrl.split("/");
    String[] targetElements = targetUrl.split("/"); // keep in mind that the arrays are of different length!
    StringBuilder uniquePart = new StringBuilder();
    StringBuilder relativePart = new StringBuilder();
    boolean stillSame = true;
    for(int ii = 0; ii < sourceElements.length || ii < targetElements.length; ii++) {
        if(ii < targetElements.length && ii < sourceElements.length && 
                stillSame && sourceElements[ii].equals(targetElements[ii]) && stillSame) continue;
        stillSame = false;
        if(targetElements.length > ii)
          uniquePart.append("/").append(targetElements[ii]);
        if(sourceElements.length > ii +1)
            relativePart.append("../");
    }

    String result = relativePart.toString().substring(0, relativePart.length() -1) + uniquePart.toString();
    System.out.println("result: " + result);
Angelo Fuchs
  • 9,825
  • 1
  • 35
  • 72
0

This should do it. keep in mind that you can calculate the baseUrl by measuring how far source and target urls are the same!

    String baseUrl = "http://www.example.com/mysite/whatever/"; // the base of your site
    String sourceUrl = "http://www.example.com/mysite/whatever/somefolder/bar/unsecure!+?#whätyöühäv€it/site.html"; // your current site
    String targetUrl = "http://www.example.com/mysite/whatever/otherfolder/other.html"; // the link target
    String expectedTarget = "../../../otherfolder/other.html";
    // cut away the base.
    if(sourceUrl.startsWith(baseUrl))
        sourceUrl = sourceUrl.substring(baseUrl.length());
    if(!sourceUrl.startsWith("/"))
        sourceUrl = "/" + sourceUrl;

    // construct the relative levels up
    StringBuilder bar = new StringBuilder();
    while(sourceUrl.startsWith("/"))
    {
        if(sourceUrl.indexOf("/", 1) > 0) {
            bar.append("../");
            sourceUrl = sourceUrl.substring(sourceUrl.indexOf("/", 1));
        } else {
            break;
        }
        System.out.println("foo: " + sourceUrl);
    }

    // add the unique part of the target
    targetUrl = targetUrl.substring(baseUrl.length());
    bar.append(targetUrl);

    System.out.println("expectation: " + expectedTarget.equals(bar.toString()));
    System.out.println("bar: " + bar);
Angelo Fuchs
  • 9,825
  • 1
  • 35
  • 72