0

I just created a script which extracts the article out of a webpage via server-side JS. (If your interested: it's used for http://pipes.yahoo.com/fb55/expandr .)

I just got a little problem with internal links. Some pages include links like:

/subfolder/subpage.html

What I would need to do is fixing them and setting there root, like this:

protocol://secondlevel.firstlevel/subfolder/subpage.html

I'm using E4X for processing the page. I don't want to show my current creepy try, it's buggy and slow. Does anybody have a solution for me?

fb55
  • 1,197
  • 1
  • 11
  • 16

1 Answers1

1

You may be able to rewrite them with some Regular Expression:

var baseUrl = "http://somesite.com/somepage"
var root = baseUrl.match(/^[^:]+:\/\/[^\/]+\//)[0];
// "http://somesite.com/"

var HTML = "<a href='/testing'>test</a> and <a class='test' href=\"/foo/bar\"> </a>";

HTML.replace(/<a [^>]*href=["']\/([^'"]+)["']/ig, function (whole, url) {
  return whole.replace("/"+url, root+url);
});

// "<a href='http://somesite.com/testing'>test</a> and <a class='test' href=\"http://somesite.com/foo/bar\"> </a>"
gnarf
  • 105,192
  • 25
  • 127
  • 161
  • That's a nice solution, even if there is some performance to gain. (But it's way better than my code.) Thank you! – fb55 May 30 '10 at 13:58