Regular expression for tag replacement

Question

I'm new to regular expressions, but I'm trying to learn about it. I want to remove the tag of a html text, and let only the inner text. Something like that:

Original: Lorem ipsum <a href="http://www.google.es">Google</a> Lorem ipsum <a href="http://www.bing.com">Bing</a>
Result:  Lorem ipsum Google Lorem ipsum Bing

I'm using this code:

$patterns = array( "/(<a href=\"[a-z0-9.:_\-\/]{1,}\">)/i", "/<\/a>/i");
$replacements = array("", "");

$text = 'Lorem ipsum <a href="http://www.google.es">Google</a> Lorem ipsum <a href="http://www.bing.com">Bing</a>';
$text = preg_replace($patterns,$replacements,$text);

It works, but I don't know if this code is the more efficient or the more readable.

Can I improve the code in some way?

For a start, it won't do anything on `foo`, except replacing the closing tag. So to sanitize input that no links remain this is a poor method. — Joey, Aug 03 '10 at 11:01

Pekka · Accepted Answer · 2010-08-03T11:09:03.267

7

In your case, PHP's strip_tags() should do exactly what you need without regular expressions. If you want to strip only a specific tag (something strip_tags() can't do by default), there is a function in the User Contributed Notes.

In general, regexes are not suitable for parsing HTML. It's better to use a DOM parser like Simple HTML DOM or one of PHP's built-in parsers.

edited Aug 03 '10 at 11:09

answered Aug 03 '10 at 11:01

Pekka

442,112
142
972
1,088

score 5 · Answer 2 · edited May 23 '17 at 10:32

5

Don't use regular expressions, use a DOM parser instead.

edited May 23 '17 at 10:32

Community

1
1

answered Aug 03 '10 at 11:02

You

22,800
3
51
64

3

that should read *Don't use regular expressions* **for parsing (x)HTML**. It's not like they are totally useless ;) – Gordon Aug 03 '10 at 11:33

score 2 · Answer 3 · answered Aug 03 '10 at 11:03

2

If your content only contains anchor tags, then strip_tags is probably easier to use.

Your preg_replace won't replace if there are spurious spaces between a and href, or if there are any other attributes in the tag.

answered Aug 03 '10 at 11:03

Mark Baker

209,507
32
346
385

score 2 · Answer 4 · answered Aug 03 '10 at 11:47

In this case, using regex is not a good idea. Having said that:

<?php
    $text = 'Lorem ipsum <a href="http://www.google.es">Google</a> Lorem ipsum <a href="http://www.bing.com">Bing</a>';
    $text = preg_replace(
        '@\\<a\\b[^\\>]*\\>(.*?)\\<\\/a\\b[^\\>]*\\>@',
        '\\1',
        $text
    );
    echo $text;
    // Lorem ipsum Google Lorem ipsum Bing
?>

This is a very trivial regex, its not bullet proof.

score 0 · Answer 5 · edited May 23 '17 at 10:33

0

You can't parse [X]HTML with regex.

edited May 23 '17 at 10:33

Community

1
1

answered Aug 03 '10 at 11:04

Mizipzor

51,151
22
97
138

Actually, you can't. (X)HTML is not a regular language, and as such can't be parsed by regular expressions. – You Aug 03 '10 at 12:31

Regular expression for tag replacement

5 Answers5