0

Background

Currently have a console app that gets email from 0365 outlook account, I am using the outlook api 2.0

Problem

I am accessing the email's body using the api, however the body comes in as a html string. I am using my go to regex functionality which removes the html tags, however outlook adds a css class to to their Html which is basically making my regex expression obsolete.

Code

string body = "<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style>
<!--
@font-face
    {font-family:"Cambria Math"}
@font-face
    {font-family:Calibri}
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif}
a:link, span.MsoHyperlink
    {color:#0563C1;
    text-decoration:underline}
a:visited, span.MsoHyperlinkFollowed
    {color:#954F72;
    text-decoration:underline}
span.EmailStyle17
    {font-family:"Calibri",sans-serif;
    color:windowtext}
.MsoChpDefault
    {font-family:"Calibri",sans-serif}
@page WordSection1
    {margin:1.0in 1.0in 1.0in 1.0in}
div.WordSection1
    {}
-->
</style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">&nbsp;</p>
</div>
<hr>
<p><b>Confidentiality Notice:</b> This e-mail is intended only for the addressee named above. It contains information that is privileged, confidential or otherwise protected from use and disclosure. If you are not the intended recipient, you are hereby notified
 that any review, disclosure, copying, or dissemination of this transmission, or taking of any action in reliance on its contents, or other use is strictly prohibited. If you have received this transmission in error, please reply to the sender listed above
 immediately and permanently delete this message from your inbox. Thank you for your cooperation.</p>
</body>
</html>
";
string viewString1 = Regex.Replace(body, "<.*?>", string.Empty);
string viewString12 = viewString1.Replace("&nbsp;", string.Empty);

Results from my Regular expression

<!--
@font-face
    {font-family:"Cambria Math"}
@font-face
    {font-family:Calibri}
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    margin-bottom:.0001pt;
    font-size:11.0pt;
    font-family:"Calibri",sans-serif}
a:link, span.MsoHyperlink
    {color:#0563C1;
    text-decoration:underline}
a:visited, span.MsoHyperlinkFollowed
    {color:#954F72;
    text-decoration:underline}
span.EmailStyle17
    {font-family:"Calibri",sans-serif;
    color:windowtext}
.MsoChpDefault
    {font-family:"Calibri",sans-serif}
@page WordSection1
    {margin:1.0in 1.0in 1.0in 1.0in}
div.WordSection1
    {}
-->







Confidentiality Notice: This e-mail is intended only for the addressee named above. It contains information that is privileged, confidential or otherwise protected from use and disclosure. If you are not the intended recipient, you are hereby notified
 that any review, disclosure, copying, or dissemination of this transmission, or taking of any action in reliance on its contents, or other use is strictly prohibited. If you have received this transmission in error, please reply to the sender listed above
 immediately and permanently delete this message from your inbox. Thank you for your cooperation.

Objective

I will need to able strip html tags from the string, and also remove the css classes which outlook places in to the body.

Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
EasyE
  • 560
  • 5
  • 26
  • 1
    By the way, you might want to consider replacing   for a blank (white) space, which is what it represents (not empty). – JuanR Apr 18 '17 at 15:29

1 Answers1

3

You can replace <!--.*?--> with String.Empty with the regex option Singleline(that makes . match new lines):

string viewString1 = Regex.Replace(body, "<.*?>", string.Empty, RegexOptions.Singleline);
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142