1

Is there a good npm package that can remove unnecessary nested tags in an html string running on a nodeJS server (no browswer DOM)? I've tried sanitize-html, but it doesn't seem possible to do this.

I receive email html from the user, so I can't control the input format, and it sometimes comes with unnecessary nested tags, like so:

<div>
  <div>
    <div>
      <div>
        <div>Hey Bob:<br /></div>
        <div>
          I wanted to see if you had a chance to review this. Three things come to mind:<br />
        </div>
        <ol>
          <li>blah<br /></li>
          <li>blah<br /></li>
          <li>blah<br /></li>
        </ol>
      </div>
    </div>
  </div>
</div>

I want to unwrap the outer divs (and any other unnecessarily wrapped tags within the string) until I just have a result that looks like:

<div>
  <div>Hey Bob:<br /></div>
  <div>
    I wanted to see if you had a chance to review this. Three things come to mind:<br />
  </div>
  <ol>
    <li>blah<br /></li>
    <li>blah<br /></li>
    <li>blah<br /></li>
  </ol>
</div>

I tried using cheerio and jsdom, but neither seem to have an unwrap function like beautifulsoup does in python.

  • easiest way i can think of doing this would be to remove tabs and newlines, and then replacing anything that repeated back-to-back with only 1 copy. Would be a bit of regex. – Jhecht Nov 28 '19 at 03:46

1 Answers1

1

Not sure about what package could do this, but in your case it can easily be done with some basic vanilla javascript:

const bodyNode = document.querySelector("body");

function ParseHtml(node)
{
    if (node.firstElementChild.nodeName === 'BR')
    {
        return node.parentNode.outerHTML;
    }
    return ParseHtml(node.firstElementChild);
}

console.log(ParseHtml(bodyNode));
fYre
  • 1,212
  • 3
  • 11
  • 16