4

How to allow only specific set of HTML tags & specific set of Attributes using general Regex?

Allowed HTML Tags:

p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption

Allowed HTML Attributes:

alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel

For Testing this Regular Expression, Am using RegExr site.

Below Regex is for targeting Attributes:

((alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\s*=\s*["|']?[/.?=&#\w\s:;-]+["|']?)

Below Regex is for targeting HTML tags:

<(?>/?)(?:[^p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption|P]|[p|cufontext|cufoncanvas|P][^\s>/])[^>]*>

I tried to merge both something like this but it is not filtering properly:-

<(?>/?)(?:[^p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption|P]|[p|cufontext|cufoncanvas|P]|((alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\s*=\s*["|']?[/.?=&#\w\s:;-]+["|']?)[^\s>/])[^>]*>

My intention is to allow only this set of Attributes and HTML tags.

Rest of tags and attributes should be removed and content should be left.

EXAMPLE:

INPUT HTML:

<h2 class="callout" cufid="2"><cufon style="width: 88px; height: 18px" class="cufon cufon-vml" alt="Lorem  "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 107px; height: 29px" path=" m39,-257 l75,-257,75,0,39,0,39,-257 x e m-41,-394 l2097,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m61,-157 c67,-174,93,-193,115,-192,142,-192,167,-170,166,-142 l166,0,131,0,131,-137 c134,-180,68,-170,61,-142 l61,0,27,0,26,-189,61,-189,61,-157 x e m-144,-394 l1994,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-144,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-326,-394 l1812,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-326,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-427,-394 l1711,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-427,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m11,-94 c11,-144,47,-192,94,-192,146,-192,177,-149,177,-94,177,-44,141,2,94,2,41,1,11,-39,11,-94 x m93,-178 c29,-172,34,-21,93,-14,155,-20,157,-172,93,-178 x e m-548,-394 l1590,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-548,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m15,-86 c15,-142,46,-192,95,-192,114,-192,127,-185,133,-174 l133,-257,168,-257,168,-34 c154,-10,130,2,95,2,52,2,15,-42,15,-86 x m134,-153 c128,-167,117,-177,98,-178,68,-178,54,-147,54,-86,54,-24,94,2,133,-24 x e m-728,-394 l1410,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-728,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m62,-51 c61,-9,125,-16,132,-47 l132,-189,166,-189,166,0,132,0,132,-28 c125,-10,106,1,81,2,-2,5,36,-116,28,-189 l62,-189,62,-51 x e m-909,-394 l1229,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-909,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m150,-6 c86,20,16,-20,16,-89,16,-163,78,-215,150,-182 l150,-158 c112,-211,48,-154,55,-94,49,-36,110,12,149,-31 x e m-1093,-394 l1045,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1093,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-1248,-394 l890,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1248,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m61,-157 c67,-174,93,-193,115,-192,142,-192,167,-170,166,-142 l166,0,131,0,131,-137 c134,-180,68,-170,61,-142 l61,0,27,0,26,-189,61,-189,61,-157 x e m-1336,-394 l802,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1336,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m15,-88 c6,-164,80,-221,134,-175 l134,-189,168,-189 c159,-82,207,86,87,80,64,79,45,75,31,66 l31,39 c68,87,150,59,134,-18,94,37,6,-25,15,-88 x m96,-178 c35,-178,36,-9,106,-14,119,-15,128,-21,133,-31 l133,-156 c121,-171,108,-178,96,-178 x e m-1518,-394 l620,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1518,-394" coordsize="2138,577"></cvml:shape><cvml:shape style="width: 107px; height: 29px" path=" m-1693,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1693,-394" coordsize="2138,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 45px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 65px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-41,-394 l1256,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m44,-175 c74,-204,147,-194,147,-143 l147,0,112,0,112,-23 c96,12,7,13,14,-45,18,-86,44,-97,87,-114,126,-130,124,-173,84,-175,66,-175,53,-166,44,-149 l44,-175 x m112,-116 c94,-97,41,-84,47,-43,52,-8,96,-9,112,-31 l112,-116 x e m-201,-394 l1096,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-201,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m13,-36 c28,-4,92,-11,88,-53,84,-91,9,-100,12,-147,15,-193,77,-204,110,-176 l110,-152 c100,-176,50,-189,45,-156,50,-111,121,-110,121,-59,121,-6,50,20,13,-12 l13,-36 x e m-360,-394 l937,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-360,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-483,-394 l814,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-483,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-571,-394 l726,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-571,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-731,-394 l566,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-731,-394" coordsize="1297,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m-852,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-852,-394" coordsize="1297,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 45px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 65px; height: 29px" path=" m85,-32 c85,-74,123,-134,113,-189 l148,-189,187,-33 c191,-84,214,-142,226,-189 l254,-189 c238,-128,211,-64,202,0 l160,0,131,-118 c121,-77,104,-37,100,0 l59,0,7,-189,42,-189 x e m-41,-394 l1251,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m11,-94 c11,-144,47,-192,94,-192,146,-192,177,-149,177,-94,177,-44,141,2,94,2,41,1,11,-39,11,-94 x m93,-178 c29,-172,34,-21,93,-14,155,-20,157,-172,93,-178 x e m-292,-394 l1000,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-292,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-472,-394 l820,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-472,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m60,0 l27,0,27,-257,60,-257,60,0 x e m-593,-394 l699,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-593,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m15,-86 c15,-142,46,-192,95,-192,114,-192,127,-185,133,-174 l133,-257,168,-257,168,-34 c154,-10,130,2,95,2,52,2,15,-42,15,-86 x m134,-153 c128,-167,117,-177,98,-178,68,-178,54,-147,54,-86,54,-24,94,2,133,-24 x e m-666,-394 l626,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-666,-394" coordsize="1292,577"></cvml:shape><cvml:shape style="width: 65px; height: 29px" path=" m-847,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-847,-394" coordsize="1292,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 44px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 64px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-41,-394 l1226,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-142,-394 l1125,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-142,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m44,-175 c74,-204,147,-194,147,-143 l147,0,112,0,112,-23 c96,12,7,13,14,-45,18,-86,44,-97,87,-114,126,-130,124,-173,84,-175,66,-175,53,-166,44,-149 l44,-175 x m112,-116 c94,-97,41,-84,47,-43,52,-8,96,-9,112,-31 l112,-116 x e m-263,-394 l1004,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-263,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m95,-27 c108,-97,122,-119,145,-189 l174,-189 c155,-128,125,-66,113,0 l70,0,7,-189,41,-189 x e m-422,-394 l845,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-422,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-589,-394 l678,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-589,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m60,0 l27,0,27,-257,60,-257,60,0 x e m-749,-394 l518,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-749,-394" coordsize="1267,577"></cvml:shape><cvml:shape style="width: 64px; height: 29px" path=" m-822,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-822,-394" coordsize="1267,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 36px; height: 18px" class="cufon cufon-vml" alt="Lorem "><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 55px; height: 29px" path=" m85,-32 c85,-74,123,-134,113,-189 l148,-189,187,-33 c191,-84,214,-142,226,-189 l254,-189 c238,-128,211,-64,202,0 l160,0,131,-118 c121,-77,104,-37,100,0 l59,0,7,-189,42,-189 x e m-41,-394 l1063,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-292,-394 l812,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-292,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-376,-394 l728,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-376,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m61,-160 c90,-207,171,-202,171,-141 l171,2,136,2,136,-140 c133,-179,79,-176,61,-145 l61,2,26,2,26,-257,61,-257,61,-160 x e m-477,-394 l627,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-477,-394" coordsize="1104,577"></cvml:shape><cvml:shape style="width: 55px; height: 29px" path=" m-659,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-659,-394" coordsize="1104,577"></cvml:shape></cufoncanvas><cufontext>Lorem </cufontext></cufon><cufon style="width: 60px; height: 18px" class="cufon cufon-vml" alt="Lorem ipsum"><cufoncanvas style="height: 29px; top: -5px; left: -2px"><cvml:shape style="width: 78px; height: 29px" path=" m185,-35 c185,-12,172,0,146,0 l27,0,27,-257 c84,-255,173,-268,179,-221,169,-240,103,-239,63,-238 l63,-163 c102,-164,138,-164,148,-136,141,-144,90,-143,63,-143 l63,-20 c103,-22,170,-12,185,-35 x e m-41,-394 l1514,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-41,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m160,-158 c176,-210,259,-200,259,-141 l259,0,225,0,225,-138 c224,-175,174,-179,161,-142 l161,0,126,0,126,-139 c127,-180,62,-176,62,-141 l62,0,27,0,27,-189,62,-189,62,-158 c73,-198,152,-205,160,-158 x e m-217,-394 l1338,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-217,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m67,0 l32,0,32,-190,67,-190,67,0 x m69,-241 c69,-229,60,-221,49,-221,38,-221,29,-230,29,-241,29,-252,38,-261,49,-261,60,-261,69,-253,69,-241 x e m-492,-394 l1063,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-492,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m63,-154 c69,-182,103,-204,127,-183 l127,-152 c107,-185,64,-157,63,-128 l63,0,29,0,29,-189,63,-189,63,-154 x e m-569,-394 l986,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-569,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m44,-175 c74,-204,147,-194,147,-143 l147,0,112,0,112,-23 c96,12,7,13,14,-45,18,-86,44,-97,87,-114,126,-130,124,-173,84,-175,66,-175,53,-166,44,-149 l44,-175 x m112,-116 c94,-97,41,-84,47,-43,52,-8,96,-9,112,-31 l112,-116 x e m-690,-394 l865,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-690,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m68,-172 l68,0,33,0,33,-172,3,-172 c32,-188,54,-208,68,-232 l68,-189,108,-189,108,-168 c100,-173,82,-172,68,-172 x e m-849,-394 l706,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-849,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m60,-58 c71,-14,125,3,152,-36 l151,-13 c94,27,15,-19,15,-91,15,-143,44,-192,94,-192,124,-192,152,-174,152,-147,152,-99,82,-86,60,-58 x m120,-149 c121,-167,109,-179,94,-179,62,-179,47,-115,55,-75,78,-95,120,-108,120,-149 x e m-950,-394 l605,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-950,-394" coordsize="1555,577"></cvml:shape><cvml:shape style="width: 78px; height: 29px" path=" m13,-36 c28,-4,92,-11,88,-53,84,-91,9,-100,12,-147,15,-193,77,-204,110,-176 l110,-152 c100,-176,50,-189,45,-156,50,-111,121,-110,121,-59,121,-6,50,20,13,-12 l13,-36 x e m-1110,-394 l445,183 ns e" stroked="f" fillcolor="#c0bbaf" coordorigin="-1110,-394" coordsize="1555,577"></cvml:shape></cufoncanvas><cufontext>Lorem ipsum</cufontext><cvml:shape coordsize="1000,1000"></cvml:shape></cufon></h2>

<div class="contentContainer">
<p>Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
<p>Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
</div>

EXPECTED FILTERED OUTPUT

<h2>Lorem Lorem Lorem Lorem Lorem Lorem ipsum</h2>

<div>
<p>Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
<p>Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet Lorem ipsum dolor sit amet</p>
</div>

One more example, to make it more clear:-

Input

  1. <abc id="test">new tag and known attribute</abc>
  2. <a id="test" href="http://www.google.com/" xyz="testattr">known tag, attribute and one unknown attr</a>

Output

  1. new tag and known attribute
  2. <a id="test" href="http://www.google.com/">known tag, attribute and one unknown attr</a>

Appreciate for the help.

Siva Charan
  • 17,940
  • 9
  • 60
  • 95

4 Answers4

3

Here is a Perl solution using PCRE compatible regex. It is not aware of comments, doctype, CDATA, etc. Those should be added for a more complete solution.

# allowed tag and attribute names

my $allowed_tags_open = 'p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|a|tr|td|table|tbody|label|div|sup|sub|caption';

my $allowed_tags_self_closing = 'img|br|hr';

my $allowed_attributes = 'alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel';

$allowed_attributes .= '|style'; # for testing


# definitions for matching allowed tag and attribute names

my $re_tags = qr~(?(DEFINE)
    (?<tags_open>
        /?+
        (?>
            (?: $allowed_tags_open )
            (?! [^\s>/] )       # from (?&tagname)
        )
    )
    (?<tags_self_closing>
        (?>
            (?: $allowed_tags_self_closing )
            (?! [^\s>/] )       # from (?&tagname)
        )
    )
    (?<tags>    (?> (?&tags_open) | (?&tags_self_closing) )    )
    (?<attribs>
        (?>
            (?: $allowed_attributes )
            (?! [^\s=/>] )      # from (?&attname)
        )
    )
)~xi;


# definitions for matching the tags
# trying to follow compatible tokenization characteristics of modern browsers

my $re_defs = qr~(?(DEFINE)
    (?<tagname> [a-z/][^\s>/]*+    )    # will match the leading / in closing tags
    (?<attname> [^\s>/][^\s=/>]*+    )  # first char can be pretty much anything, including =
    (?<attval>  (?>
                    "[^"]*+" |
                    \'[^\']*+\' |
                    [^\s>]*+            # unquoted values can contain quotes, = and /
                )
    ) 
    (?<attrib>  (?&attname)
                (?: \s*+
                    = \s*+
                    (?&attval)
                )?+
    )
    (?<crap>    (?!/>)[^\s>]    )       # most crap inside tag is ignored, but don't eat the last / in self closing tags
    (?<tag>     <(?&tagname)
                (?: \s*+                # spaces between attributes not required: <b/foo=">"style=color:red>bold red text</b>
                    (?>
                        (?&attrib) |    # order matters
                        (?&crap)        # if not an attribute, eat the crap
                    )
                )*+
                \s*+ /?+
                >
    )
)~xi;



sub sanitize_html{
    my $str = shift;
    $str =~ s/(?&tag) $re_defs/ sanitize_tag($&) /gexo;
    return $str;
}


sub sanitize_tag{
    my $tag = shift;

    my ($name, $attr, $end) =
        $tag =~ /^ < ((?&tags)) (.*?) ( \/?+ > ) $   $re_tags/xo
        or return '';  # return empty string if not allowed tag

    # return a new clean closing tag if it's a closing tag
    return "<$name>" if substr($name, 0, 1) eq '/';

    # clean attributes
    return "<$name" . sanitize_attributes($attr) . $end;
}


sub sanitize_attributes{
    my $attr = shift;
    my $new = '';

    $attr =~ s{
        \G
        \s*+                 # spaces between attributes not required
        (?>
            ( (?&attrib) ) | # order matters
            (?&crap)         # if not an attribute, eat the crap
        )

        $re_defs
    }{
        my $att = $1;
        $new .= " $att" if $att && $att =~ /^(?&attribs) $re_tags/xo;
        '';
    }gexo;

    return $new;
}

Test (ideone):

my $test = <<'_TEST_';
<b>simple</b>
self <img>closing</img>

<abc id="test">new tag and known attribute</abc>
<a id="test" xyz="testattr" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a id="foo">

<b/#ñ%&/()!¢º`=">="">crap be gone</b> not bold<br/x"/>
<b/style=color:red;background:url("x.gif");/*="still.CSS*/ id="x"zz"<script class="x">tricky</b/ x=">"//> not bold
_TEST_

print $test, "\n";
print '-' x 70, "\n";
print sanitize_html $test;

Output:

<b>simple</b>
self <img>closing</img>

<abc id="test">new tag and known attribute</abc>
<a id="test" xyz="testattr" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a id="foo">

<b/#ñ%&/()!¢º`=">="">crap be gone</b> not bold<br/x"/>
<b/style=color:red;background:url("x.gif");/*="still.CSS*/ id="x"zz"<script class="x">tricky</b/ x=">"//> not bold

----------------------------------------------------------------------
<b>simple</b>
self <img>closing

new tag and known attribute
<a id="test" href="/foo">one unknown attr</a>
<a id="foo">attr in closing tag</a>

<b>crap be gone</b> not bold<br/>
<b style=color:red;background:url("x.gif");/*="still.CSS*/ id="x" class="x">tricky</b> not bold

See how your browser parses the tricky tags: jsFiddle

Possibly relevant:

Qtax
  • 33,241
  • 9
  • 83
  • 121
  • Can you please add one more regex with general format? Since am not able to change perl regex to general regex format. – Siva Charan Mar 13 '12 at 12:16
  • @SivaCharan, there is no "general format". Exactly what flavor do you need (e.i what language are you using)? If you are using PHP (or anything else with PCRE) these expressions will work as they are written, you just need to change the quoting and surrounding code. – Qtax Mar 13 '12 at 12:28
  • Actually I'm using it on Classic ASP – Siva Charan Mar 13 '12 at 12:37
  • "general format" mean "regex should be able to use in other languages with minimal change". So other users can utilize our regex easily. If you can give a better solution in a general format, then i can mark your answer as accepted. But make sure you give a single line of regex. Thanks for spending time. – Siva Charan Mar 13 '12 at 12:41
1

This seems very similar to a question I posted awhile back:

How do I filter all HTML tags except a certain whitelist?

Community
  • 1
  • 1
richardtallent
  • 34,724
  • 14
  • 83
  • 123
  • Thanks for posting. In this link, it is only for HTML tags. But I need to target attributes too. If any ideas or suggest, please post. – Siva Charan Feb 29 '12 at 06:57
0

You can't parse HTML with regex (there's a reason why that is one of the top voted posts on Stackoverflow)

Community
  • 1
  • 1
Lachlan McDonald
  • 2,195
  • 2
  • 21
  • 25
0

Finally I have achieved this in two steps:-

//Allowed list of HTML Tags

<(?!/?(p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption)(>|\s))[^<]+?>

//Allowed list of HTML Attributes

\s(?!(alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel))\w+(\s*=\s*["|']?[/.,#?\w\s:;-]+["|']?)

Using above two regex, I have filtered my whole html.

EDIT:

Now I have reduced it into one regex, which filter all required HTML tags & attributes

(<(?!/?(p|body|b|u|em|strong|ul|ol|li|h1|h2|h3|h4|h5|h6|hr|a|br|img|tr|td|table|tbody|label|div|sup|sub|caption)(>|\s))[^<]+?>)|(\s(?!(alt|href|tcmuri|title|height|width|align|valign|rowspan|colspan|src|summary|class|id|name|title|target|nowrap|scope|axis|cellpadding|cellspacing|dir|lang|rel)\b)[\w:]+(\s*=\s*["|']?[/.,#?\w\s:;-]+["|']?))
Siva Charan
  • 17,940
  • 9
  • 60
  • 95
  • Do not use these. Fails on for example ``, and `fail`. Please tell me what's wrong with the expressions in my answer. – Qtax Mar 13 '12 at 11:50
  • @Qtax: Unfortunately I didn't come across this type of example format. Thanks for updating a scenario, where it fails. But this is rare case which I may encounter. – Siva Charan Mar 13 '12 at 12:21
  • 1
    Yes, this is one 'rare' case you have not encountered. There is more. No matter how much you fix up your regex, there will still be more cases it doesn't cover. This is because you cannot parse HTML with regular expressions. Not possible. See this: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – AHM Apr 22 '12 at 12:45
  • @AHM: Why this is been downvoted? My answer fullfill upto max extent, if you are passing input as string. – Siva Charan Apr 22 '12 at 12:50
  • In actual, we can't parse HTML with regex. Thats I will accept. Here in my question "Input is passed as a string which am handling it through my regex". So it solves the problem. – Siva Charan Apr 22 '12 at 12:53
  • @AHM: I have put lot of effort to build this regex to solve my problem. There are lot of scenarios where regex is not best solution and also there are some scenarios which regex solves the problem timebeing. Thanks if you can take out your downvote. – Siva Charan Apr 22 '12 at 13:10
  • 2
    `But this is rare case which I may encounter.`. This is a cross site scripting vulnerability (see https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet) which could allow a malicious user to (for example) add a JavaScript to listen for changes on an input field (such as a password field) and send that data to a remote server. Hence this could pose a serious risk to users of your site. – Aaron Newton Nov 22 '13 at 08:58