Substring a String at 90% without breaking HTML using Java

Question

What would be the best solution for creating method that would take string containing HTML and would chop off the last lets say 10% of the string but without breaking any HTML tags.

The body and header tags are not part of the HTML string.

Also the rounding should happen upwards so lats say that if the last 10% would shrink to 5% if HTML should remain in untouched than the methods should rather cut the begging of HTML and perform 15% cut which would be the begging of the tag.

I'm thinking of using Jsoup for this. The problem is that the string might not be enclosed by HTML elements. It just might be a text with couple of links in it.

Konrad Reiche · Accepted Answer · 2012-01-10T18:02:46.710

3

I think Jsoup is just the right way, remove the elements from the bottom of the page and check its string length in every step until you reach a satisfying number.

For removing the elements one by one you could use the remove method then compare the original string length with the current string length of the HTML document. I do not see any efficiency problem there.

edited Jan 10 '12 at 18:02

answered Jan 10 '12 at 17:51

Konrad Reiche

27,743
15
106
143

@MatBanik Simply compare the length of the resulting String using the `toString` method with the length of the original String representing the HTML document. – Konrad Reiche Jan 10 '12 at 18:01
Try running the W3C Validator http://validator.w3.org/, maybe Jsoup cannot spot elements when their parents are broken?! – Konrad Reiche Jan 10 '12 at 18:32

Substring a String at 90% without breaking HTML using Java

1 Answers1