-2

Need to perform some HTML cleansing.

Have HTML that has lots of redundant br tags, so far tried HtmlCleaner and jTidy without any results.

Example:

<br>
<br>
<br>
<br>
...

What I would like is just to get a single <br> back

Any other ways to accomplish this without manually parsing line by line?

AlexVPerl
  • 7,652
  • 8
  • 51
  • 83

1 Answers1

0

If your only trying to remove superfluous <br/> tags then I recommend a simple parsing state machine using Jericho to do the parsing since Jericho is very good about preserving data.

The state machine would simply keep the last tag seen and if the last tag seen is a <br/> tag and the next tag is a <br/> tag you simply omit it. Its a pretty simple exercise that I recommend you try. I don't recommend though manual text parsing (ie not using a HTML parser) as its very error prone.

I would also like to remind you that despite how people may use <br/> tags it is an explicit content tag. So removing the tag is changing the content. Perhaps instead of scraping some HTML you get the content from a more structured source like XML feed, REST API, or database, etc.

Adam Gent
  • 47,843
  • 23
  • 153
  • 203
  • You're right that `
    ` is actually content. But I would state that it's sometimes a good thing removing newlines, just like one would trim a string from leading or trailing whitespaces.
    – MC Emperor Nov 11 '14 at 17:10
  • Certain legal documents and regulated specifications require specific white spacing. As a user inputting documents I would find it annoying if you stripped my explicit new lines. This is different then trimming input on a single field like a title field. – Adam Gent Nov 11 '14 at 18:35
  • It is, @AdamGent. But even StackOverflow seems to do it. In this very comment, for example, newlines *are* removed from the comment. (I have newlines before and after *'for example'*. They are still there if I edit the comment, but they're not visible.) – MC Emperor Nov 11 '14 at 19:26