0

I am trying to do some web crawling and I came across an issue of when to add a slash or not. I know that some sites do have it at the end and some don't but entering the wrong one in the browser will just redirect you to the right one. Normalization would add the slash at the end but its going to cause a problem when trying to convert the relative URLs to absolute.

For example if a user selects an absolute URL http://stack.com/more but the actual (redirect) URL is http://stack.com/more/ and a relative url is index.html

Then doing URL newurl = new URL(url, relativeURL);

yields http://stack.com/index.html (non existant page)

when it should actually be http://stack.com/more/index.html(real page)

Doese anyone know a good way to correctly append the slash at the end?

Dan
  • 8,263
  • 16
  • 51
  • 53

1 Answers1

4

If a relative URL starts with a /, it's only relative to the root (the domain). So both

http://stack.com/more/ + /index.html

and

http://stack.com/more + /index.html

are correctly resolved to

http://stack.com/index.html

not

http://stack.com/more/index.html

In your example, it makes no difference whatsoever whether there's a / at the end of more.

The trick comes in when there's no leading slash on the relative URL, e.g. index.html. When resolving those, you're supposed to remove the last segment and replace it with the relative path. It would make a difference in that case, because

http://stack.com/more/ + index.html

resolves to

http://stack.com/more/index.html

but

http://stack.com/more + index.html

resolves to

http://stack.com/index.html

(index.html replaces more, because more is the final segment).

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • For my example the relative url was not suppose to have a / in the beginning, that was a typo sorry. Say you dont know if the url should have the / at the end. Is there a strategy to decide the correct url? In your case, `http://stack.com/more/index.html` – Dan Nov 06 '12 at 11:23
  • @Dan: `http://stack.com/more` + `index.html` is correctly `http://stack.com/index.html`, not `http://stack.com/more/index.html`. I suspect the only way to know how to combine them is to make sure you have the canonical form of the first one. If the site responds to `http://stack.com/more` by redirecting you to `http://stack.com/more/`, then you know adding `index.html` should be `http://stack.com/more/index.html`. If it directly responds to `http://stack.com/more`, then you know that adding `index.html` to that should be `http://stack.com/index.html`. – T.J. Crowder Nov 06 '12 at 11:29