0

I'm using the following pattern to capture links, and turn them into HTML friendly links. I use the following pattern in a preg_replace_callback and for the most part it works.

"#(https?|ftp)://(\S+[^\s.,>)\];'\"!?])#"

But this pattern fails when the text reads like so:

http://mylink.com/page[/b]

At that point it captures the [/b amusing it is part of the link, resulting in this:

<a href="http://woodmill.co.uk[/b">woodmill.co.uk[/b</a>]

I've look over the pattern, and used some cheat sheets to try and follow what is happening, but it has foxed me. Can any of you code ninja's help?

JDB
  • 25,172
  • 5
  • 72
  • 123
mattauckland
  • 483
  • 1
  • 7
  • 17
  • Can you explain in plain language what your matching criteria is? Is you intent to simply capture the portion of the URL up to the point where there is an illegal character (i.e. not allow in URL) because your URL's don't necessarily have whitespace after them? – Mike Brant Jan 19 '13 at 01:24
  • @MikeBrant In simple terms I wanted to capture a url as long as it didn't end with a full stop or a comma. So http://mydomain.com/page would be fine, but http://mydomain.com/page. would fail. It is intended to be part of a CMS, and I did find a solution shortly after posting this question (doh!) in the form of a new length pattern I found at this question: [link](http://stackoverflow.com/questions/12352635/making-a-url-regex-global/14410248#14410248) – mattauckland Jan 19 '13 at 01:41
  • You should post an answer to your question **to your question**, not to someone else's question. – JDB Jan 19 '13 at 02:36
  • @Cyborgx37 Maybe, but after further testing of the pattern from the other question, I found it still breaks. So it isn't a solution after all. – mattauckland Jan 20 '13 at 01:05

2 Answers2

0

Try adding the open square bracket to your character class:

(\S+[^\s.,>)[\];'\"!?])
            ^

UPDATE

Try this more effective URL regex:

^(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w \.-]*)*/?$

(From: http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/)

I have no experience directly with PHP regular expressions, but the above is simple and generic enough that I wouldn't expect any problems. You may want to modify it some to extract just the domain, like you seem to be with your current regex.

JDB
  • 25,172
  • 5
  • 72
  • 123
0

Ok I solved the problem. Thanks to @Cyborgx37 and @MikeBrant for your help. Here's the solution.

Firstly I replaced my regexp pattern with the one that João Castro used in this question: Making a url regex global

The problem with that pattern is it captured any trailing dots at the end, so in the final section of the pattern I added ^. making the final part look like so [^\s^.]. As I read it, do not match a trailing space or dot.

This still caused an issue matching bbcode as I mentioned above, so I used preg_replace_callback() and create_function() to filter it out. The final create_function() looks like this:

create_function('$match','
                $match[0] = preg_replace("/\[\/?(.*?)\]/", "", $match[0]);
                $match[0] = preg_replace("/\<\/?(.*?)\>/", "", $match[0]);
                $m = trim(strtolower($match[0]));
                $m = str_replace("http://", "", $m);
                $m = str_replace("https://", "", $m);
                $m = str_replace("ftp://", "", $m);
                $m = str_replace("www.", "", $m);

                if (strlen($m) > 25)
                {
                    $m = substr($m, 0, 25) . "...";
                }

                return "<a href=\"$match[0]\" target=\"_blank\">$m</a>";
'), $string);

Tests so far are looking good, so I'm happy it is now solved.

Thanks again, and I hope this helps someone else :)

Community
  • 1
  • 1
mattauckland
  • 483
  • 1
  • 7
  • 17