2

I've been searching for a regex to replace plain text url's in a string (the string can contain more than 1 url), by:

 <a href="url">url</a>

and I found this: http://mathiasbynens.be/demo/url-regex

I would like to use the diegoperini's regex (which according to the tests is the best):

_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS

But I want o make it global to replace all the url's in a string. When I use this:

/_(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?_iuS/g

It does not work, how do I make this regex global and what does the underscore at the beginning and the "_iuS", at the end, means?

I would like to use it with php so I am using:

preg_replace($regex, '<a href="$0">$0</a>', $examplestring);
João Castro
  • 45
  • 1
  • 6
  • Sorry did not understand, that last comment, what's define? – João Castro Sep 10 '12 at 13:40
  • 1
    `preg_replace` replaces all occurrences by default, you should be able to just remove the `^` and `$` anchors. – verdesmarald Sep 10 '12 at 13:46
  • Yes you are right, I was using an url like this: www.google.pt , as the second url in the text string, and I thought it was not being replaced because it was only replacing the first match but turns out the regex doesn't match urls like that. – João Castro Sep 10 '12 at 14:06

2 Answers2

0

The underscores are the regex delimiters, the i, u and S are pattern modifiers :

i (PCRE_CASELESS)

If this modifier is set, letters in the pattern match both upper and lower 
case letters.

U (PCRE_UNGREEDY)

This modifier inverts the "greediness" of the quantifiers so that they are 
not greedy by default, but become greedy if followed by ?. It is not compatible
with Perl. It can also be set by a (?U) modifier setting within the pattern 
or by a question mark behind a quantifier (e.g. .*?).

S

When a pattern is going to be used several times, it is worth spending more 
time analyzing it in order to speed up the time taken for matching. If this 
modifier is set, then this extra analysis is performed. At present, studying 
a pattern is useful only for non-anchored patterns that do not have a single 
fixed starting character.

For more informations see http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

When you added the / ... /g , you added another regex delimiter plus the modifier g wich does not exists in PCRE, that's why it did not work.

Community
  • 1
  • 1
Oussama Jilal
  • 7,669
  • 2
  • 30
  • 53
0

I agree with @verdesmarald and used this pattern in the following function:

$string = preg_replace_callback(
        "_(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?_iuS",
        create_function('$match','
            $m = trim(strtolower($match[0]));
            $m = str_replace("http://", "", $m);
            $m = str_replace("https://", "", $m);
            $m = str_replace("ftp://", "", $m);
            $m = str_replace("www.", "", $m);

            if (strlen($m) > 25)
            {
                $m = substr($m, 0, 25) . "...";
            }

            return "<a href=\"$match[0]\">$m</a>";
                '), $string);

    return $string;

It seem to do the trick, and resolve an issue I was having. As @verdesmarald said, removing the ^ and $ characters allowed the pattern to work even in my pre_replace_callback().

Only thing that concerns me, is how efficient is the pattern. If used in a busy/high traffic web app, could it cause a bottle neck?

UPDATE

The above regex pattern breaks if there is a trail dot at the end of the path section of a url, like so http://www.mydomain.com/page.. To solve this I modified the final part of the regex pattern by adding ^. making the final part look like so [^\s^.]. As I read it, do not match a trailing space or dot.

In my tests so far it seems to be working fine.

mattauckland
  • 483
  • 1
  • 7
  • 17
  • 1
    Please post this as self answer to your question http://stackoverflow.com/questions/14410134/preg-replace-callback-pattern-issue rather than this question. – nhahtdh Jan 19 '13 at 02:15
  • @nhahtdh Not now I won't, as I've found with further testing this pattern also breaks when faced with a url and bold closing tag like so: `http://mydomain/contact` and if you take the backticks out you'll notice that stackoverflow also has the same fault, like so: http://mydomain/contact – mattauckland Jan 20 '13 at 01:08