Remove redundant space in HTML in JAVA

Question

Need to perform some HTML cleansing.

Have HTML that has lots of redundant br tags, so far tried HtmlCleaner and jTidy without any results.

Example:

<br>
<br>
<br>
<br>
...

What I would like is just to get a single   back

Any other ways to accomplish this without manually parsing line by line?

Its basically just a bunch of br tags repeated, want to replace with a single br. Just added more detail to the question. — AlexVPerl, Nov 11 '14 at 01:18
you could send your html through an online minifier then do a mass replace i.e. http://www.willpeavy.com/minifier/ — austin wernli, Nov 11 '14 at 17:08

score 0 · Answer 1 · answered Nov 11 '14 at 17:05

0

If your only trying to remove superfluous   tags then I recommend a simple parsing state machine using Jericho to do the parsing since Jericho is very good about preserving data.

The state machine would simply keep the last tag seen and if the last tag seen is a   tag and the next tag is a   tag you simply omit it. Its a pretty simple exercise that I recommend you try. I don't recommend though manual text parsing (ie not using a HTML parser) as its very error prone.

I would also like to remind you that despite how people may use   tags it is an explicit content tag. So removing the tag is changing the content. Perhaps instead of scraping some HTML you get the content from a more structured source like XML feed, REST API, or database, etc.

answered Nov 11 '14 at 17:05

Adam Gent

47,843
23
153
203

You're right that `
` is actually content. But I would state that it's sometimes a good thing removing newlines, just like one would trim a string from leading or trailing whitespaces. – MC Emperor Nov 11 '14 at 17:10
Certain legal documents and regulated specifications require specific white spacing. As a user inputting documents I would find it annoying if you stripped my explicit new lines. This is different then trimming input on a single field like a title field. – Adam Gent Nov 11 '14 at 18:35
It is, @AdamGent. But even StackOverflow seems to do it. In this very comment, for example, newlines *are* removed from the comment. (I have newlines before and after *'for example'*. They are still there if I edit the comment, but they're not visible.) – MC Emperor Nov 11 '14 at 19:26

Remove redundant space in HTML in JAVA

1 Answers1