54

I have this html code:

<p style="padding:0px;">
  <strong style="padding:0;margin:0;">hello</strong>
</p>

How can I remove attributes from all tags? I'd like it to look like this:

<p>
  <strong>hello</strong>
</p>
Andres SK
  • 10,779
  • 25
  • 90
  • 152

10 Answers10

179

Adapted from my answer on a similar question

$text = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello</strong></p>';

echo preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/si",'<$1$2>', $text);

// <p><strong>hello</strong></p>

The RegExp broken down:

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]        # Match 'a' through 'z'
  [a-z0-9]*    # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $2 - '/' if it is there
 >             # Match '>'
/is            # End Pattern - Case Insensitive & Multi-line ability

Add some quoting, and use the replacement text <$1$2> it should strip any text after the tagname until the end of tag /> or just >.

Please Note This isn't necessarily going to work on ALL input, as the Anti-HTML + RegExp will tell you. There are a few fallbacks, most notably <p style=">"> would end up <p>"> and a few other broken issues... I would recommend looking at Zend_Filter_StripTags as a more full proof tags/attributes filter in PHP

T.Todua
  • 53,146
  • 19
  • 236
  • 237
gnarf
  • 105,192
  • 25
  • 127
  • 161
  • 12
    Shouldn't use Regular Expressions on HTML – 472084 Nov 13 '11 at 14:12
  • 8
    @Jleagle are you serious? There is already a comment IN THE ANSWER mentioning ways to break this regular expression while parsing HTML. There are times when parsing HTML with a regexp is plenty fine (like the HTML is generated by some known system, therefore quite regular. If you are going to comment something about not parsing HTML with Regular Expressions - at least add something that isn't already stated in the answer. – gnarf Nov 13 '11 at 22:22
  • 1
    I have something like this I want the src to be retained because what the code is doing, it is deleting all the attributes. You have any idea with this? :) – PinoyStackOverflower May 21 '12 at 11:02
  • Could any of the fallbacks cause a security problem? Would something like this `$some_tags_filtered = strip_tags($_POST['message'], '

    ');` combined with your method to remove the attributes be safe from XSS attacks?

    – Dan Bray Mar 15 '16 at 21:20
  • 2
    if you know the tags you could do something like `$plain_value = preg_replace("/<(p|br)[^>]*?(\/?)>/i",'<$1>', $plain_value);` – mikewasmike Apr 01 '16 at 09:46
  • Better way: Find [(?si)(?:<(\[\w:\]+)\s+(?:".*?"|'.*?'|(?:(?!/>)\[^>\])?)+(/?)>|<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>".*?"|'.*?'|(?:(?!/>)\[^>\])?)+)?\s*>).*?\3\s*(?=>))|(?:/?\[\w:\]+\s*/?)|(?:\[\w:\]+\s+(?:".*?"|'.*?'|\[^>\]?)+\s*/?)|\?.*?\?|(?:!(?:(?:DOCTYPE.*?)|(?:\\[CDATA\\[.*?\\]\\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?))))>(*SKIP)(?!))](https://regex101.com/r/IoQrAW/1) Replace `<$1$2>` –  Nov 12 '19 at 21:55
  • Watch this answer's pattern fail: https://3v4l.org/6sTsa just because the answer admits that it has flaws doesn't mean the flaws should be tolerated. There are far more reliable answers on this page. Furthermore, this pattern is needlessly using a non-greedy quantifier and the `s` pattern modifier serves absolutely no purpose. – mickmackusa Jan 15 '21 at 21:20
  • Using [an online tool that is DOM-sensitive](https://html-cleaner.com/) reminded me how challenging these tasks can be. What I really wanted was to remove all _size_ tags such as `width=` while keeping the `colspan` tags in my table.... – Josiah Yoder Sep 09 '22 at 16:52
  • This worked for me but it doesn't handle custom tags in HTML5 very well. These typically start with text then a hyphen, to distinguish them from native HTML5 tags. I changed the patter to `"/<([a-z][a-z0-9-]*)[^>]*?(\/?)>/si"` and it handled custom tags as well. – Dave Child Mar 02 '23 at 09:06
83

Here is how to do it with native DOM:

$dom = new DOMDocument;                 // init new DOMDocument
$dom->loadHTML($html);                  // load HTML into it
$xpath = new DOMXPath($dom);            // create a new XPath
$nodes = $xpath->query('//*[@style]');  // Find elements with a style attribute
foreach ($nodes as $node) {              // Iterate over found elements
    $node->removeAttribute('style');    // Remove style attribute
}
echo $dom->saveHTML();                  // output cleaned HTML

If you want to remove all possible attributes from all possible tags, do

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {
    $node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML();
eozzy
  • 66,048
  • 104
  • 272
  • 428
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • This does not seem to work when attempting to remove "width" and "height" attributes from HTML tags. Removing any other attribute works fine, except these. –  Oct 06 '15 at 10:42
  • 1
    How would I remove all tag attributes except `href` using this method? – HelpingHand Mar 08 '16 at 16:30
  • 1
    @HelpingHand `$node->nodeName` contains the attribute name. So you can either `if` on that or change the XPath to `//@*[not(., "href")]` – Gordon Mar 09 '16 at 13:38
  • why in first case `removeAttribe` is used directly on node, and in second example, on `parentNode`? maybe mechanical mistake in example? – T.Todua Apr 11 '20 at 12:58
  • @T.Todua in the first example the XPath returns all DOMNodes with a style attribute. In the second example, the XPath returns all DOMAttribute nodes. So you need to traverse to the parent DOMNode in order to remove the attribute. – Gordon Apr 14 '20 at 09:24
  • @HelpingHand https://stackoverflow.com/a/65744326/2943403 – mickmackusa Jan 15 '21 at 22:37
  • For me this is the clever way to do it – AugustoM Aug 22 '23 at 13:49
10

I would avoid using regex as HTML is not a regular language and instead use a html parser like Simple HTML DOM

You can get a list of attributes that the object has by using attr. For example:

$html = str_get_html('<div id="hello">World</div>');
var_dump($html->find("div", 0)->attr); /
/*
array(1) {
  ["id"]=>
  string(5) "hello"
}
*/

foreach ( $html->find("div", 0)->attr as &$value ){
    $value = null;
}

print $html
//<div>World</div>
Yacoby
  • 54,544
  • 15
  • 116
  • 120
3
$html_text = '<p>Hello <b onclick="alert(123)" style="color: red">world</b>. <i>Its beautiful day.</i></p>';
$strip_text = strip_tags($html_text, '<b>');
$result = preg_replace('/<(\w+)[^>]*>/', '<$1>', $strip_text);
echo $result;

// Result
string 'Hello <b>world</b>. Its beautiful day.'
  • The OP doesn't want to strip any tags -- that part is inappropriate. Your regex pattern mutilates the DOM when the innerhtml of a tag contains `<` then an alphanumeric character. Proof: https://3v4l.org/au5fi – mickmackusa Jan 15 '21 at 21:11
3

Another way to do it using php's DOMDocument class (without xpath) is to iterate over the attributes on a given node. Please note, due to the way php handles the DOMNamedNodeMap class, you must iterate backward over the collection if you plan on altering it. This behaviour has been discussed elsewhere and is also noted in the documentation comments. The same applies to the DOMNodeList class when it comes to removing or adding elements. To be on the safe side, I always iterate backwards with these objects.

Here is a simple example:

function scrubAttributes($html) {
    $dom = new DOMDocument();
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    for ($els = $dom->getElementsByTagname('*'), $i = $els->length - 1; $i >= 0; $i--) {
        for ($attrs = $els->item($i)->attributes, $ii = $attrs->length - 1; $ii >= 0; $ii--) {
            $els->item($i)->removeAttribute($attrs->item($ii)->name);
        }
    }
    return $dom->saveHTML();
}

Here's a demo: https://3v4l.org/M2ing

  • 1
    These "LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD" are important parameters to recover the same HTML you're giving the function. Thanks! – Elber CM Sep 19 '22 at 15:45
1

Optimized regular expression from the top rated answer on this issue:

$text = '<div width="5px">a is less than b: a<b, ya know?</div>';

echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);

// <div>a is less than b: a<b, ya know?</div>

UPDATE:

It works better when allow only some tags with PHP strip_tags() function. Let's say we want to allow only <br>, <b> and <i> tags, then:

$text = '<i style=">">Italic</i>';

$text = strip_tags($text, '<br><b><i>');
echo preg_replace("/<([a-z][a-z0-9]*)[^<|>]*?(\/?)>/si",'<$1$2>', $text);

//<i>Italic</i>

As we can see it fixes flaws connected with tag symbols in attribute values.

fractal512
  • 11
  • 2
0

Hope this helps. It may not be the fastest way to do it, especially for large blocks of html. If anyone has any suggestions as to make this faster, let me know.

function StringEx($str, $start, $end)
{ 
    $str_low = strtolower($str);
    $pos_start = strpos($str_low, $start);
    $pos_end = strpos($str_low, $end, ($pos_start + strlen($start)));
    if($pos_end==0) return false;
    if ( ($pos_start !== false) && ($pos_end !== false) )
    {  
        $pos1 = $pos_start + strlen($start);
        $pos2 = $pos_end - $pos1;
        $RData = substr($str, $pos1, $pos2);
        if($RData=='') { return true; }
        return $RData;
    } 
    return false;
}

$S = '<'; $E = '>'; while($RData=StringEx($DATA, $S, $E)) { if($RData==true) {$RData='';} $DATA = str_ireplace($S.$RData.$E, '||||||', $DATA); } $DATA = str_ireplace('||||||', $S.$E, $DATA);
Brandon Orth
  • 313
  • 1
  • 3
  • 10
  • Please show an online demo of this technique in action. When I ran it with my own sample html, the script suffered an infinite loop. Also, why would you use `str_ireplace()` when replacing a sequence of pipes? – mickmackusa Jan 15 '21 at 21:06
0

Regex's are too fragile for HTML parsing. In your example, the following would strip out your attributes:

echo preg_replace(
    "|<(\w+)([^>/]+)?|",
    "<$1",
    "<p style=\"padding:0px;\">\n<strong style=\"padding:0;margin:0;\">hello</strong>\n</p>\n"
);

Update

Make to second capture optional and do not strip '/' from closing tags:

|<(\w+)([^>]+)| to |<(\w+)([^>/]+)?|

Demonstrate this regular expression works:

$ phpsh
Starting php
type 'h' or 'help' to see instructions & features
php> $html = '<p style="padding:0px;"><strong style="padding:0;margin:0;">hello<br/></strong></p>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<p><strong>hello</strong><br/></p>
php> $html = '<strong>hello</strong>';
php> echo preg_replace("|<(\w+)([^>/]+)?|", "<$1", $html);
<strong>hello</strong>
Greg K
  • 10,770
  • 10
  • 45
  • 62
  • this one has a bug, if there is only hello it returns hello – Andres SK Jun 11 '10 at 22:05
  • Updated regex to address issues identified in comments – Greg K Jan 14 '12 at 14:44
  • This regex completely hoses hyperlinks. `link` becomes `link`. – Joseph Leedy May 22 '14 at 14:48
  • Regex is inappropriate for parsing html. Watch this answer's pattern fail: https://3v4l.org/Spisv – mickmackusa Jan 15 '21 at 21:00
  • @mickmackusa: First thing stated is "Regex's are too fragile for HTML parsing", re your example - wouldn't the "<" be HTML entity encoded as content within tags? https://3v4l.org/7QIZU – Greg K Jan 21 '21 at 12:57
  • 1. So why would you knowingly give researchers fragile advice? 2. We don't know if this is a scraped html document or a manually crafted string that does not have encoded attribute values. – mickmackusa Jan 22 '21 at 10:22
  • 1
    This answer is a decade old, given with best intentions. Why have you picked up on an answer half way down a graveyard question? – Greg K Jan 24 '21 at 21:49
-1

To do SPECIFICALLY what andufo wants, it's simply:

$html = preg_replace( "#(<[a-zA-Z0-9]+)[^\>]+>#", "\\1>", $html );

That is, he wants to strip anything but the tag name out of the opening tag. It won't work for self-closing tags of course.

Sp4cecat
  • 991
  • 1
  • 8
  • 18
  • This breaks when any tag"s innerhtml contains `>` then an alphanumeric character. Proof: https://3v4l.org/nnpc2 – mickmackusa Jan 15 '21 at 20:56
-1

Here's an easy way to get rid of attributes. It handles malformed html pretty well.

<?php
  $string = '<p style="padding:0px;">
    <strong style="padding:0;margin:0;">hello</strong>
    </p>';

  //get all html elements on a line by themselves
  $string_html_on_lines = str_replace (array("<",">"),array("\n<",">\n"),$string); 

  //find lines starting with a '<' and any letters or numbers upto the first space. throw everything after the space away.
  $string_attribute_free = preg_replace("/\n(<[\w123456]+)\s.+/i","\n$1>",$string_html_on_lines);

  echo $string_attribute_free;
?>
  • `\w` already includes numbers so the chatacter class and `123456` is useless. The greedy dot matching is also exposing this pattern to overmatching. Regex is a bad idea for this task, this pattern is an extra bad idea. – mickmackusa Jan 15 '21 at 20:46