0

I'm using Mailgun's awesome Inbound Routing to parse my incoming emails, remove the HTML and email signatures, which leaves me with the raw body of text.

Below is a small example of what is returned:

{
  "stripped-html": "<html><body><div style=\"font-family: Helvetica; font-size: 13px;\">Testing with <b>bold<\/b>&#160;and <u>stuff<\/u><br><\/div><div style=\"font-family: Helvetica; font-size: 13px;\"><u><br><\/u><\/div><div style=\"font-family: Helvetica; font-size: 13px;\">:)<\/div>&#13;\n                <div><div><br><\/div><div>--&#160;<\/div><div>Tim Smith<\/div><div><br><\/div><\/div>&#13;\n                 &#13;\n                <p style=\"color: #A0A0A8;\"><\/p>&#13;\n                <div>&#13;\n                    <br><\/div><\/body><\/html>",
  "stripped-text": "Testing with bold and stuff\n\n:)",
  "stripped-signature": "-- \nTim Smith"
}

What I want to do is take the plain stripped-text but also replicate basic formatting like bold, italic, and underlined. In this example the word "bold" is bold and the world "stuff" is underlined.

What would be the best way to tackle this?

brandonscript
  • 68,675
  • 32
  • 163
  • 220
floatleft
  • 6,243
  • 12
  • 43
  • 53
  • 1
    You'd be better off parsing the HTML than the stripped text, because the stripped text has no information about what should or could be formatted. – brandonscript Dec 16 '14 at 05:06

1 Answers1

1

I would take the "stripped-html" string and sanitize it, this way you get rid of the escape strings...

...then you could do two things:

  1. Run a regular expression that matches the styling and disregard all other content. althought the latter two are deprecated since HTML4 and are replaced by for bold and a css style property (font-style: italic). For example: First you match outer html with(<\w* \w*=".*?">)(.*)(<\/\w*>) then recursively look for bold and other elements such as <b>(.*?)</b> for b tags with no further attributes.

Once replaced all b tags with bold then you could simply move on to the next tag.

  1. Use a parser - for example http://simplehtmldom.sourceforge.net/ could be a good start if you want to use PHP.
API_sheriff_orlie
  • 1,223
  • 10
  • 18