0

I would like to replace relative URLs to absolute URLs in a textarea. So something like this:

/somefolder/somefile 

Is replaced to:

http://www.mysite123.com/somefolder/somefile

I have this replace function to do the job:

$replaceStrs = array('href=/', "href='/", 'href="/');
$datdescription = str_ireplace($replaceStrs, 'href="http://www.' . $domain . "/", $datdescription);
  1. The problem is that it needs a / in the start of the value and therefore a URL like href=somefolder/somefile would not be replaced.
  2. I also would like it to work if there are spaces before or / and after the = in the href part.

Point 1 is most important. Can you help to improve this?

I have seen PHP examples that replaces relative URLs to absolute URLs like this one.

But the requirement is that the relative URL is known / found but in my case I have not managed this part (I am working with replacing all URLs in a textarea).

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Jens Kirk
  • 526
  • 1
  • 5
  • 19
  • Do your URLs come from bbcode/markdown or something like that? Or is it plain HTML? If it's plain HTML my answer should work. If it's not the quotes might be missing depending on bbcode/markdown syntax for urls. – Mihai Stancu May 20 '12 at 09:58
  • Thank you very much :-) It is plain HTML, no markdown :-) – Jens Kirk May 20 '12 at 12:45

3 Answers3

0

PHP:

function expand_links($link) {
    return('href="http://example.com/'.trim($link, '\'"/\\').'"');
}
$textarea = preg_replace('/href\s*=\s*(?<href>"[^\\"]*"|\'[^\\\']*\')/e', 'expand_links("$1")', $textarea);

I also changed the regex to work with either double quotes or apostrophes.

Mihai Stancu
  • 15,848
  • 2
  • 33
  • 51
  • 1
    PCRE means perl compatible... which means JavaScript is compatible and so is PHP. I tested my answer in firebug which uses JavaScript. Now it's reformated to PHP. – Mihai Stancu May 20 '12 at 09:54
  • Thank you :-) I also need it to handle relative urls that are not starting with a "/" like: **preg_replace('/href\s*=\s*"([^"]*?)"/', $domain . '$1', 'href="folder/file.html"');** – Jens Kirk May 20 '12 at 12:53
  • The regular expression did not look for slashes. It looked **href space equals space anyquote** – Mihai Stancu May 20 '12 at 13:06
  • Thank you :-) How do I echo the new / replaced value? My code is: **bold** page1 ... bla bla page2 bla bla...."; preg_match('/href\s*=\s*(?"[^"]*"|\'[^\']*\')/', $datdescription, $matches); $href = $domain.trim($matches['href'], '\'"/'); // warning if end slash stripped from href echo $datdescription; ?>**bold** – Jens Kirk May 20 '12 at 13:53
  • In other words: How do I echo the new / replaced textarea value? – Jens Kirk May 20 '12 at 14:04
  • echo $href; /* i had written this answer earlier without the comment, but SO decided it's to bloody short (and i did't notice soon enough */ – Mihai Stancu May 20 '12 at 14:09
  • When I echo $href I get **http://example.com/folder1/page1.html**. I need to echo the new and replaced value of the textarea ... please look for $datdescription :-) This is my textarea value – Jens Kirk May 20 '12 at 14:14
  • Not working fully yet :-) If I use **$textarea = "tester afadf adf page1 ... bla bla page2 bla bla....";** then I get urls like **http://example.com//'folder1/page1.html/**. Something is wrong. – Jens Kirk May 20 '12 at 14:56
  • @JensKirk Weird that only happens for single quotes around the link (double quotes around the entire ); preg_match must be escaping double quoted strings somehow. Done! – Mihai Stancu May 20 '12 at 15:15
  • Working fully :-D thank you very much :-) Have a nice day :-) – Jens Kirk May 20 '12 at 16:54
  • @Hihai Stancu - please see :-) http://stackoverflow.com/questions/10675771/php-replacing-absolute-urls-in-a-textarea – Jens Kirk May 20 '12 at 17:56
  • Is it really working? Really? How sure are you, considering we're talking about regular expressions? What if I replace the first slash of a path with `⁄`, will it keep working? – Christian May 20 '12 at 18:25
  • @Christian it doesn't use slashes to identify the urls. It looks for **href="anything not containing a quote or a backslash"** if you put ⁄ in there it will match it as part of the url string. – Mihai Stancu May 20 '12 at 18:57
  • @Christian I don't understand what **considering we're talking about regular expressions** means? Are they imprecise? Incorrect? Un advisable? What's wrong with regular expressions (the way you said it makes it sound like they're bad altogether not necessarily for this activity in particular, if I misinterpreted I apologise). – Mihai Stancu May 20 '12 at 19:07
  • @MihaiStancu Exactly my point! `⁄` is not a URL, it's a `/`, when an xml/dom parser reads it correctly. Regular expressions are a simple set of rules which you think are necessary at the moment, when in fact standards and formats are much more complex. Regex in this case is both inadvisable and over-generalized. I can see why Jens would(should?) want to parse this as what it really is, HTML. Not plain innocent text. – Christian May 20 '12 at 20:16
  • ⁄ is harmless in this case, worst case scenario it'll make the link not work and it can be post processed. The regex for example doesn't allow any real backslash inside it (which generally speaking a url would not contain) so it would not allow for the quoted string to be tampered with. – Mihai Stancu May 20 '12 at 20:25
0

Why all this fuss when a PHP function can already do this for you?

http://php.net/manual/en/function.http-build-url.php

PS: It seems it's only available on PECL. I just tested my Hostgator VPS (standard CentOS 5 repos) as well as my test WAMP environment, and it seems to be available on both.

NB: Also, you REALLY shouldn't blindly replace HTML fragments. First of all, it may not work eventually (encoding issues), secondly, it may add security issues to your code.

Christian
  • 27,509
  • 17
  • 111
  • 155
  • In order to use http_build_url() you first need to **have** the url. – Mihai Stancu May 20 '12 at 14:43
  • We're talking about a text area containing well... what ever people put in text areas... including HTML links which we must extract for processing. – Mihai Stancu May 20 '12 at 14:44
  • There is no security risk here, because I am using my own domains. The thing is that I have one domain with a calendar. And then I have another calendar on another domain where I transfer some events to, but the relative urls are not working “over there” (without having the first domain inserted). I am VERY open for a more simple solution :-D Can you help with a fully working example? – Jens Kirk May 20 '12 at 15:02
  • Mihai Stancu - Huh? He does have the URL, he just wants to **build** one. Regarding the HTML parsing part, it would be better to have something like DOM or XML parser going instead of regex. – Christian May 20 '12 at 18:19
  • Jens Kirk - Using your own domains is irrelevant here, if people can inject javascript/custom html etc, you're screwed. You could mitigate the risk by using something like DOM or XML parsing to only take the info you need (like the src/href attribs) and discard the rest (onclick etc...everything). – Christian May 20 '12 at 18:22
  • @Christian you are right that XSS attacks can happen. This script is only supposed to extract addresses, nothing more. It doesn't do any validation. That can be done separately but the topic here is not validation. – Mihai Stancu May 20 '12 at 18:59
  • And no, he does not have the URL, the url is contained in a long text string. Parsing that string with DOM manipulators is not always an option. Most WYSIWYG editors generate bad code with syntax errors, some use their own namespacing techniques without specifying DTDs or XSDs for them, some use ilegal html entities. If you would parse them with an XML parser for example, you'd get tons of parse errors. An HTML parser may be more lenient but both incur performance penalties greater than regex does. – Mihai Stancu May 20 '12 at 19:04
  • @MihaiStancu So you're going to "fix" parser errors with regex? Doesn't xml/dom sound more appropriate to parse it than some highly specialized regex? – Christian May 20 '12 at 20:12
  • Dud the OP wanted started with a regex, i gave him the regex he needed. He's obviously inexperienced and if you ask me there's a million bugs and security flaws his app will suffer from, but he's learning from this experience. – Mihai Stancu May 20 '12 at 20:19
  • Now about you... you may be well intended but try to be less invasive. And try to understand that the fact that you are a professional does not mean StackOverflow can make a professional out of somebody else in a matter of hours. Your input has been heard, it has been agreed that you are right, and right now the OP is discussing XML/DOM on another thread in order to complete/extend the solution discussed here. – Mihai Stancu May 20 '12 at 20:21
  • I simply pointed out why regex is inappropriate. Just because it works on the surface doesn't mean it's the best solution (nor that it will work in the future). Oh, on the regex matters, there's probably thousands of SO questions out there with answers advising one not to mess regex with html (https://www.google.com/search?btnG=1&pws=0&q=site%3Astackoverflow.com+html+regex+php) :) – Christian May 20 '12 at 20:24
  • Glad to see you smiling maybe we can get on the same boat now. :) – Mihai Stancu May 20 '12 at 20:26
0

I expanded Mihai Stancu answer for you!


<?php 
function expand_hrefs($link, $url) {
    return('href="http://'.$url.'/'.trim($link, '\'"/\\').'"');
}

function expand_srcs($link, $url) {
    return('src="http://'.$url.'/'.trim($link, '\'"/\\').'"');
}

$html = preg_replace('/href\s*=\s*(?<href>"[^\\"]*"|\'[^\\\']*\')/e', 'expand_hrefs("$1", "'.$url.'")', $html);
$html = preg_replace('/src\s*=\s*(?<src>"[^\\"]*"|\'[^\\\']*\')/e', 'expand_srcs("$1", "'.$url.'")', $html);
?>

This is MY first answer..

Stackoverflow.com is Brilliant!

Folding Circles
  • 454
  • 2
  • 5
  • 13