0

I have written a script to retrieve emails from my IMAP server. All works fine.

I'd like to keep some of the HTML tags and so have written additional code to strip out tags not included within my allowed list - again, all working fine.

My issue is that some emails received have additional content which I also want removed. For example, a recent email received contained...

v:* {behavior:url(#default#VML);} o:* {behavior:url(#default#VML);} w:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);}

at the top of the email content.

How can I remove such content to ensure I only capture the actual email content?

I'd rather not use the plain text content (unless that is the only content within the email) as emails may contain links or emphasise certain phrases which I need to maintain.

Thanks Mike

Mike
  • 143
  • 5

1 Answers1

0

You could use preg_replace() to skip some unwanted content.

Try something like :

$str = "
content
v:* {behavior:url(#default#VML);}
some other content
o:* {behavior:url(#default#VML);}
some other content
w:* {behavior:url(#default#VML);}
some other content
.shape {behavior:url(#default#VML);}
some other content
" ;

$str = preg_replace('~([a-z]:\*|\.shape) \{(.*?)\}~', '', $str);
var_dump($str) ;

Will outputs :

string(89) "
content

some other content

some other content

some other content

some other content
"
Syscall
  • 19,327
  • 10
  • 37
  • 52