2

I am making a search engine from scratch (lol), and I am stick with this problem:

When a user submits a URL, my "spider" "crawls" it for other links. Some people of course use <a href="/page"> instead of <a href="http://long-domain.com/page">, so I detect that with if(substr($link->getAttribute('href'), 0, 1) == '/')

And add a domain in front of it. BUT, whenever I do add a domain, some links become http://php.net//abcd. As you can see its //.

Now, my idea was to make my script edit the submitted URL so if it has a slash at the end, it'll be removed, but I have no idea how to remove it.

Marcel Korpel
  • 21,536
  • 6
  • 60
  • 80
user2153768
  • 95
  • 2
  • 10
  • 2
    Relative URLs are _a lot_ more complicated than just starting with a slash. In fact, **many** relative URLs do not begin with a slash at all - and still don't include a domain name. – Colin M Mar 10 '13 at 12:47
  • 1
    ... and then you might have domain names in valid relative URLs (`/www.domain.com/pages`); and protocol-relative URLs `//domain.com/page` – Pekka Mar 10 '13 at 12:48
  • I haven't experienced those problems before, I'll do a small detection. Thanks. – user2153768 Mar 10 '13 at 12:49
  • @user2153768 A link on `http://google.com/directory` may appear just as `index.html`, which means the result should be: `http://google.com/directory/index.html`. It may be `/index.html`, meaning it should be: `http://google.com/index.html`, or it could be `./index.html` or `../index.html` or any number of other combinations. This approach isn't likely to work well at all. – Colin M Mar 10 '13 at 12:50
  • 2
    That could be interesting for you: http://stackoverflow.com/questions/4444475/transfrom-relative-path-into-absolute-url-using-php – Nabab Mar 10 '13 at 12:51

2 Answers2

11

You can use rtrim

$url = rtrim($url, '/');

It will remove all / at the end of a string, or leave it unchanged if there are none

Marko D
  • 7,576
  • 2
  • 26
  • 38
  • Except now this creates an additional problem with relative URLs that don't begin with a leading slash. You end up with `http://domain.comdirectory/file` – Colin M Mar 10 '13 at 12:49
  • That was the answer to the question of how to remove the last `/`. About relative urls, he should just check if url is starting with `http`, and if it is, then it's absolute url – Marko D Mar 10 '13 at 12:51
  • @MarkoD Checking if a URL has a scheme simply tells you if it's relative or absolute. It doesn't tell you how to resolve it at all. Also, yes - you did answer the question. But in doing so you created the exact _opposite_ problem (now, instead of double slashes on URLs he will end up with _no_ slashes on many URLs) – Colin M Mar 10 '13 at 12:53
  • @ColinMorelli Nabab gave him a good link how to solve that problem, I just answered his initial question, though I understand your point :-) – Marko D Mar 10 '13 at 12:56
  • @MarkoD I know - and I didn't downvote you. Just explaining why I commented in the first place. – Colin M Mar 10 '13 at 12:57
  • @ColinMorelli Thanks, we have an understanding :) – Marko D Mar 10 '13 at 12:58
  • I just reproduced the problem with this HTMLBIN - http://html-bin.appspot.com/aghodG1sLWJpbnIMCxIEUGFnZRiJ-GIM – user2153768 Mar 10 '13 at 13:00
  • can you explain in more details? – Marko D Mar 10 '13 at 13:02
  • Like others said I solved the problem with `` but created a new one with ``. – user2153768 Mar 10 '13 at 13:06
  • have a look at the link Nabab left you in a comment – Marko D Mar 10 '13 at 13:07
1

just do a string replace on the final url

<?php $final_url=str_replace("//","/",$your_link_to_be_crawled); ?>

that is simple enough.

to put the // back after it affected the http://,

lets do preg_replace

<?php
$your_url_to_crawl;
$patterns = array(); $patterns[0] = '/http:/';$patterns[1] = '/https:/'; $patterns[2] = '/any_other_protocol/';
$replacements = array(); $replacements[2] = 'http://'; $replacements[1] = 'https://';$replacements[0] = 'any_other_protocol';
echo preg_replace($patterns, $replacements, $your_url_to_crawl);

?>

jcobhams
  • 796
  • 2
  • 12
  • 29
  • OMG..dnt think abt that...hmmmmmmm – jcobhams Mar 10 '13 at 13:02
  • how about you remove the slashes the first time, of course the http:// will be affected then you put it back..look at my new answer – jcobhams Mar 10 '13 at 13:08
  • I thought about it, but I'll try something more efficient. – user2153768 Mar 10 '13 at 13:08
  • I did a little editing to my code just now and was able to somehow fix it. – user2153768 Mar 10 '13 at 13:16
  • post your answer please as this could help someone else – jcobhams Mar 10 '13 at 13:19
  • I used `ltrim();`. I really don't know how I did it but I used that function and I used `str_replace();`. – user2153768 Mar 10 '13 at 13:24
  • alright then...when ur done with the search engine be sure to let me know...lets take it for a test run...lol.. – jcobhams Mar 10 '13 at 13:31
  • lol ok I'll post a comment on here. It's not done yet at all, I am just making an admin center where you add URLs, but I think it looks nice so far. Thanks for the help guys. I added a valid URL detection, so if a URL is invalid (eg "index.php" without a URL), it won't be added. – user2153768 Mar 10 '13 at 13:38