1

I am new in python and I want to filter html tags by using regex. I used the function as below:

  def filter_tags(htmlstr):
        re_cdata=re.compile('//<!\[CDATA\[.*//\]\]>',re.DOTALL)
        re_script=re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>',re.DOTALL)#Script
        re_style=re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>',re.I)#style
        re_br=re.compile('<br\s*?/?>')
        re_h=re.compile('</?\w+[^>]*>')
        re_function = re.compile('')
        re_comment=re.compile('<!--[^>]*-->')
        s=re_cdata.sub('',htmlstr)
        s=re_script.sub('',s) 
        s=re_style.sub('',s)
        s=re_br.sub('',s)
        s=re_h.sub('',s) 
        s=re_comment.sub('',s)
        s = re.sub('\\t','',s)
        s = re.sub(' ','',s)
        return s

Most tags and codes can be removed except some js functions, and I got some trouble like this:

(function(){
NTES.ajax.importJs('http://news.163.com/special/hot_tags_recommend_data/',function(){
varname1,name2,len1,len2,width1,width2,left2;
varloveData=['拎婚房待嫁北京爷们','请网友鉴定是否美女'];
if(hotTagsData.count&&hotTagsData.count>0){
varcode='#from=article',
html=[],
item={name:'',url:''};
for(vari=0;i<hotTagsData.data.length&&i<4;i++){
item=hotTagsData.data[i];
html.push(''+item.name+'');
if(i==1){name1=item.name;}
if(i==2){name2=item.name;}
}
html.push(loveData[0]);
html.push(loveData[1]);
NTES('#js-extraTagList').innerHTML=html.join('');
len1=name1.replace(/[^\x00-\xff]/gi,"aa").length;
len2=name2.replace(/[^\x00-\xff]/gi,"aa").length;
width1=Math.floor((len1/(len1+len2))*271);
width2=271-width1;
left2=96+width1+19;
NTES('.extra-tag-1').addCss('width:'+width1+'px');
NTES('.extra-tag-2').addCss('width:'+width2+'px;left:'+left2+'px;');
}
},'gbk');
})();

As you can see, there are many founctions like this.So how can I remove these by using regex? thanks a lot.

Chris
  • 37
  • 1
  • 8
  • 2
    http://stackoverflow.com/a/1732454/876937 – Xophmeister Oct 20 '15 at 16:07
  • Rolling out your own HTML sanitizer is not a good idea. There are too many edge cases to worry about. Anywhere security is a concern, you should always prefer a tried-and-tested library. Take a look here: http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter – DaoWen Oct 20 '15 at 16:28

2 Answers2

1

Your regular expression: <\s*script[^>]*>[^<]*<\s*/\s*script\s*> should not have the [^<]*. You should reserve that just for matching tags themselves. Instead you should use the non-greedy *, usually syntactically denoted as: *? so it would look like <\s*script[^>]*>.*?<\s*/\s*script\s*>. You should change this where ever you made that, including the style tags and comment regex.

This should take care of the majority of cases. However it still does not protect you from tags that have a string in it with '</script>', although that should be rare. Such cases are most likely far and few between and if such a case arises you can strip it out manually.

Paul Carlton
  • 2,785
  • 2
  • 24
  • 42
  • thank you very much.It works,and all of the functions are removed, but only a little code has not removed.just like this:`varcpm_rdm=Math.random(); adInfoTempSc= { src:"http://img2.126.net/ntesrich/2015/0922/1442887187409_89q7.swf", url:"http://g.163.com/a?CID=37873&Values=1760993544&Redirect=http://e.cn.miaozhen.com/r/k=2012070&p=6we7m&ro=sm&dx=0&rt=2&ns=__IP__&ni=__IESID__&v=__LOC__&nd=__DRA__&np=__POS__&nn=__APP__&o=http://cars.fxauto.com.cn/s500/003/", key:"8531446021442887975191892" } if(cpm_rdm>0.6&&cpm_rdm'); } ` – Chris Oct 21 '15 at 01:29
0

I have solved this problem by DataHerder's answer.when I change my regular expression as the way he says.Most of the code can be removed, but only a little js code not.so I watched the raw html code, and found that the js code which is not removed looks like this:

<SCRIPT LANGUAGE="JavaScript">
var cpm_rdm=Math.random();
</SCRIPT>
<!--五分之一视窗 020903-->
<SCRIPT type="text/javascript">
adInfoTempSc = 
{
    src:"http://img2.126.net/ntesrich/2015/0922/1442887187409_89q7.swf",
    url:"http://g.163.com/a?CID=37873&Values=1760993544&Redirect=http://e.cn.miaozhen.com/r/k=2012070&p=6we7m&ro=sm&dx=0&rt=2&ns=__IP__&ni=__IESID__&v=__LOC__&nd=__DRA__&np=__POS__&nn=__APP__&o=http://cars.fxauto.com.cn/s500/003/",
    key:"8531446021442887975191892"
}
if(cpm_rdm>0.6&&cpm_rdm<0.8){
document.write('<scr'+'ipt type="text/javascript" src="http://img2.126.net/ntesrich/2015/0901/scbox-2015.09.01.js"></scr'+'ipt>');
}
</SCRIPT>

I thought the reson that the code can't be removed is this is written by upper case, just like this:<SCRIPT LANGUAGE="JavaScript">. So I add a little to my regular expression.Now I can filter all the tags and codes.thanks again. The regex now :

re_cdata=re.compile('//<!\[CDATA\[.*//\]\]>',re.DOTALL) 
re_script=re.compile('<\s*script[^>]*>.*?<\s*/\s*script\s*>',re.DOTALL|re.I)
re_style=re.compile('<\s*style[^>]*>.*?<\s*/\s*style\s*>',re.DOTALL|re.I)
re_br=re.compile('<br\s*?/?>')
re_h=re.compile('</?\w+.*?>',re.DOTALL)
re_comment=re.compile('<!--.*?-->',re.DOTALL)

re.I is used to match uppercase

Chris
  • 37
  • 1
  • 8