Java Library to truncate html strings?

Question

I need to truncate html string that was already sanitized by my app before storing in DB & contains only links, images & formatting tags. But while presenting to users, it need to be truncated for presenting an overview of content.

So I need to abbreviate html strings in java such that

<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />   
<br/><a href="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />

when truncated does not return something like this

<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />   
<br/><a href="htt

but instead returns

<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />   
<br/>

I'm unclear on your specification. Do you simply want to remove all anchor ("``") tags? — markspace, Feb 17 '15 at 17:11
What do you mean with *truncate*? You should be specific in what you wish to remove. — Willem Van Onsem, Feb 17 '15 at 17:12
yes all invalid & broken html fragments should be cleaned up — Rajat Gupta, Feb 17 '15 at 17:12
@CommuSoft: By truncate I mean I need to get a substring of some length from the html string — Rajat Gupta, Feb 17 '15 at 17:13
So, your second example is what is actually in your data, and you need a way to remove XHTML that doesn't parse correctly? — markspace, Feb 17 '15 at 17:13
Honestly I have no idea how to do that. I think you've got a difficult job to create something that can take broken XHTML and somehow parse it. There's a reason that XML has strict syntax rules: so it can be parsed. Break those rules and the result is not parseable. — markspace, Feb 17 '15 at 17:17
You could parse it with a tag soup-style parser like JSoup, but then it isn't going to remove things like a self-closing anchor tag, but rather "fix" it for you (treat it as a normally-closed anchor tag with no text). — David Conrad, Feb 17 '15 at 17:21
Or use a SAX parser that lazely reads the Html/Xml and thus not first caches all... — Willem Van Onsem, Feb 17 '15 at 17:21
@CommuSoft: Could you please be a bit more clear ? I need a high performance solution for processing large no of html strings. It does not have to be a lazy implementation. — Rajat Gupta, Feb 17 '15 at 17:54
The point is that a DOM parser first reads the entire XML code, then tries to parse it into a tree. A SAX parser reads top-to-bottom (if you need to bottom anyway) and throws away everything it can't understand. — Willem Van Onsem, Feb 17 '15 at 17:56
Truncating HTML code by string length doesn't seem to make sense. Truncating by number of specific elements (e.g. a maximum of one image) seems to be more appropriate. — xehpuk, Feb 22 '15 at 20:25

score 2 · Answer 1 · answered Feb 22 '15 at 10:56

Your requirements are a bit vague, even after reading all the comments. Given your example and explanations, I assume your requirements are the following:

The input is a string consisting of (x)html tags. Your example doesn't contain this, but I assume the input can contain text between the tags.
In the context of your problem, we do not care about nesting. So the input is really only text intermingled with tags, where opening, closing and self-closing tags are all considered equivalent.
Tags can contain quoted values.
You want to truncate your string such that the string is not truncated in the middle of a tag. So in the truncated string every '<' character must have a corresponding '>' character.

I'll give you two solutions, a simple one which may not be correct, depending on what the input looks like exactly, and a more complex one which is correct.

First solution

For the first solution, we first find the last '>' character before the truncate size (this corresponds to the last tag which was completely closed). After this character may come text which does not belong to any tag, so we then search for the first '<' character after the last closed tag. In code:

public static String truncate1(String input, int size)
{
    if (input.length() < size) return input;

    int pos = input.lastIndexOf('>', size);
    int pos2 = input.indexOf('<', pos);

    if (pos2 < 0 || pos2 >= size) {
        return input.substring(0, size);
    }        
    else {
        return input.substring(0, pos2);
    }
}

Of course this solution does not consider the quoted value strings: the '<' and '>' characters might occur inside a string, in which case they should be ignored. I mention the solution anyway because you mention your input is sanatized, so possibly you can ensure that the quoted strings never contain '<' and '>' characters.

Second solution

To consider the quoted strings, we cannot rely on standard Java classes anymore, but we have to scan the input ourselves and remember if we are currently inside a tag and inside a string or not. If we encounter a '<' character outside of a string, we remember its position, so that when we reach the truncate point we know the position of the last opened tag. If that tag wasn't closed, we truncate before the beginning of that tag. In code:

public static String truncate2(String input, int size)
{
    if (input.length() < size) return input;

    int lastTagStart = 0;
    boolean inString = false;
    boolean inTag = false;

    for (int pos = 0; pos < size; pos++) {
        switch (input.charAt(pos)) {
            case '<':
                if (!inString && !inTag) {
                    lastTagStart = pos;
                    inTag = true;
                }
                break;
            case '>':
                if (!inString) inTag = false;
                break;
            case '\"':
                if (inTag) inString = !inString;
                break;
        }
    }
    if (!inTag) lastTagStart = size;
    return input.substring(0, lastTagStart);
}

simbo1905 · Answer 2 · 2015-02-22T19:47:23.483

A robust way of doing it is to use the hotsax code which parses HTML letting you interface with the parser using the traditional low level SAX XML API [Note it is not an XML parser it parses poorly formed HTML in only chooses to let you interface with it using a standard XML API).

Here on github I have created a working quick-and-dirty example project which has a main class that parses your truncated example string:

    XMLReader parser = XMLReaderFactory.createXMLReader("hotsax.html.sax.SaxParser");

    final StringBuilder builder = new StringBuilder();

    ContentHandler handler = new DoNothingContentHandler(){

        StringBuilder wholeTag = new StringBuilder();
        boolean hasText = false;
        boolean hasElements = false;
        String lastStart = "";

        @Override
        public void characters(char[] ch, int start, int length)
                throws SAXException {
            String text = (new String(ch, start, length)).trim();
            wholeTag.append(text);
            hasText = true;
        }

        @Override
        public void endElement(String namespaceURI, String localName,
                String qName) throws SAXException {
            if( !hasText && !hasElements && lastStart.equals(localName)) {
                builder.append("<"+localName+"/>");
            } else {
                wholeTag.append("</"+ localName +">");
                builder.append(wholeTag.toString());
            }

            wholeTag = new StringBuilder();
            hasText = false;
            hasElements = false;
        }

        @Override
        public void startElement(String namespaceURI, String localName,
                String qName, Attributes atts) throws SAXException {
            wholeTag.append("<"+ localName);
            for( int i = 0; i < atts.getLength(); i++) {
                wholeTag.append(" "+atts.getQName(i)+"='"+atts.getValue(i)+"'");
                hasElements = true;
            }
            wholeTag.append(">");
            lastStart = localName;
            hasText = false;
        }

    };
    parser.setContentHandler(handler);

    //parser.parse(new InputSource( new StringReader( "<div>this is the <em>end</em> my <br> friend <a href=\"whatever\">some link</a>" ) ));
    parser.parse(new InputSource( new StringReader( "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />\n<br/><a href=\"htt" ) ));

    System.out.println( builder.toString() );

It outputs:

<img src='http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg'></img><br/>

It is adding an </img> tag but thats harmless for html and it would be possible to tweak the code to exactly match the input in the output if you felt that necessary.

Hotsax is actually generated code from using yacc/flex compiler tools run over the HtmlParser.y and StyleLexer.flex files which define the low level grammar of html. So you benefit from the work of the person who created that grammar; all you need to do is write some fairly trivial code and test cases to reassemble the parsed fragments as shown above. That's much better than trying to write your own regular expressions, or worst and coded string scanner, to try to interpret the string as that is very fragile.

Is it really so that those parser will throw away the code that they do not understand? As I know the default behavior is to ignore the last / next token until the current rule starts to makes sense again. Its quite the opposite he asked for... . — Martin Kersten, Feb 21 '15 at 21:43
How can code which takes the input he gave as an example and outputs what he said he wanted be "not what he asked for"? The rest of your comment about how a SAX parser works doesn't make sense. — simbo1905, Feb 22 '15 at 19:33
As far as I got it he wanted to remove uncomplete and misformed tags from the tag soup. I doupted that the way usually error / failure tollerant parser work is quite the opposite by throwing away parts that do not fit. Thats why I asked about it unless it is implemented differently to what we are used to. — Martin Kersten, Feb 22 '15 at 20:29
Please delete your unhelpful comments and I will delete my responses to them. No one but you is interested in your mis-reading of his question, nor your ignorance of sax parsers, nor that you didn't bother to run my code. Move into questions and answers you understand don't treat SO as some chat forum. People contributing code as answers have no time to defend their answers from people who make no effort. Thanks. — simbo1905, Feb 22 '15 at 21:45
Why do you think I made no afford in order to understand and ask things. That is a lot of aggression and prejudgment put into few words to come to that conclusion. What you provided is likely not to help. Please add this to your test: what does it print out? That is what he asked for if I understood the question. If it works than I am satisified. Since you said the Sax stuff is generated from a yacc/flex I thought it comes with the default behavior for parsers like this. — Martin Kersten, Feb 23 '15 at 11:08
Why don't you run the code and tell me? My judgement that you made no effort was your poor answer (recently edited) and your poor judged comments were you offer opinion not fact. I chose not to ignore you to give you feedback. That is done and so now I will ignore you. — simbo1905, Feb 23 '15 at 19:27
Ok whatever. If someone can try his solution and check it if it works correctly for half complete tags in the middle please drop a comment. — Martin Kersten, Feb 23 '15 at 20:14
I now understand what he is up to. But for the damaged HTML stuff your solution would not work as intended. — Martin Kersten, Feb 23 '15 at 20:39
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/71528/discussion-between-simbo1905-and-martin-kersten). — simbo1905, Feb 23 '15 at 20:57
I believe, @MartinKersten made a valid point: what happens, if the truncation of originally valid html code occurs within nested elements, like "some long Tex" - does it properly close the tags (like "some ") or discard the complete input? — martin, Feb 26 '15 at 03:32
Actually that was not my point. I thought the original question was about transfering malformed HTML in snippets with wellformed HTML. But it was not. He was asking (as I understand it now) about how to cut Snippets out of a certain position and do not present incomplete tags. — Martin Kersten, Feb 26 '15 at 07:50

score 0 · Answer 3 · answered Feb 23 '15 at 20:46

Afer I understand what you want here is the most simple solution I could come up with.

Just work from the end of your substring to the start until you find '>' This is the end mark of the last tag. So you can be sure that you only have complete tags in the majority of cases.

But what if the > is inside texts?

Well to be sure about this just search on until you find < and ensure this is part of a tag (do you know the tag string for instance?, since you only have links, images and formating you can easily check this. If you find another > before finding < starting a tag this is the new end of your string.

Easy to do, correct and should work for you.

If you are not certain if strings / attributes can contain < or > you need to check the appearence of " and =" to check if you are inside a string or not. (Remember you can cut of an attribute values). But I think this is overengineering. I never found an attribute with < and > in it and usually within text it is also escaped using & lt ; and something alike.

score 0 · Answer 4 · answered Feb 26 '15 at 03:16

I don't know the context of the problem the OP needs to solve, but I am not sure if it makes a lot of sense to truncate html code by the length of its source code instead of the length of its visual representation (which can become arbitrarily complex, of course).

Maybe a combined solution could be useful, so you don't penalize html code with a lot of markup or long links, but also set a clear total limit which cannot be exceeded. Like others already wrote, the usage of a dedicated HTML parser like JSoup allows the processing of non well-formed or even invalid HTML.

The solution is loosely based on JSoup's Cleaner. It traverses the parsed dom tree of the source code and tries to recreate a destination tree while continuously checking, if a limit has been reached.

import org.jsoup.nodes.*;
import org.jsoup.parser.*;
import org.jsoup.select.*;

    String html = "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />" +
                  "<br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />";

    //String html = "<b>foo</b>bar<p class=\"baz\">Some <img />Long Text</p><a href='#'>hello</a>";

    Document srcDoc = Parser.parseBodyFragment(html, "");
    srcDoc.outputSettings().prettyPrint(false);

    Document dstDoc = Document.createShell(srcDoc.baseUri());
    dstDoc.outputSettings().prettyPrint(false);

    Element dst = dstDoc.body();

    NodeVisitor v = new NodeVisitor() {
        private static final int MAX_HTML_LEN = 85;
        private static final int MAX_TEXT_LEN = 40;

        Element cur = dst;
        boolean stop = false;
        int resTextLength = 0;

        @Override
        public void head(Node node, int depth) {
            // ignore "body" element
            if (depth > 0) {
                if (node instanceof Element) {
                    Element curElement = (Element) node;
                    cur = cur.appendElement(curElement.tagName());
                    cur.attributes().addAll(curElement.attributes());
                    String resHtml = dst.html();
                    if (resHtml.length() > MAX_HTML_LEN) {
                        cur.remove();
                        throw new IllegalStateException("html too long");
                    }
                } else if (node instanceof TextNode) {
                    String curText = ((TextNode) node).getWholeText();
                    String resHtml = dst.html();
                    if (curText.length() + resHtml.length() > MAX_HTML_LEN) {
                        cur.appendText(curText.substring(0, MAX_HTML_LEN - resHtml.length()));
                        throw new IllegalStateException("html too long");
                    } else if (curText.length() + resTextLength > MAX_TEXT_LEN) {
                        cur.appendText(curText.substring(0, MAX_TEXT_LEN - resTextLength));
                        throw new IllegalStateException("text too long");
                    } else {
                        resTextLength += curText.length();
                        cur.appendText(curText);
                    }
                }
            }
        }

        @Override
        public void tail(Node node, int depth) {
            if (depth > 0 && node instanceof Element) {
                cur = cur.parent();
            }
        }
    };

    try {
        NodeTraversor t = new NodeTraversor(v);
        t.traverse(srcDoc.body());
    } catch (IllegalStateException ex) {
        System.out.println(ex.getMessage());
    }

    System.out.println(" in='" + srcDoc.body().html() + "'");
    System.out.println("out='" + dst.html() + "'");

For the given example with max length of 85, the result is:

html too long
 in='<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg"><br><a href="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg"></a>'
out='<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg"><br>'

It also correctly truncates within nested elements, for a max html length of 16 the result is:

html too long
 in='<i>f<b>oo</b>b</i>ar'
out='<i>f<b>o</b></i>'

For a maximum text length of 2, the result of a long link would be:

text too long
 in='<a href="someverylonglink"><b>foo</b>bar</a>'
out='<a href="someverylonglink"><b>fo</b></a>'

score 0 · Answer 5 · edited Feb 26 '15 at 16:41

0

You can achieve this with library "JSOUP" - html parser.

You can download it from below link.

Download JSOUP

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class HTMLParser 
{
    public static void main(String[] args)
    {
        String html = "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" /><br/><a href=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />";

        Document doc = Jsoup.parse(html);
        doc.select("a").remove();

        System.out.println(doc.body().children());
    }
}

edited Feb 26 '15 at 16:41

Yevgen

1,576
1
15
17

answered Feb 26 '15 at 08:13

I've been working with jsoup, too and was pretty amazed how good it works. – Alexander Feb 26 '15 at 17:51
2

Sorry, but i don't see how that answers the OPs problem. In my understanding, @user01 wanted to truncate (eg. shorten) a given html snippet to a given length while preserving the correctness of html code. Instead, this code selects and removes certain elements. How does it decide, which elements to remove? Why "a"? What if the was shorter or there was no ? – martin Feb 27 '15 at 16:23

Martin Kersten · Answer 6 · 2015-02-23T13:27:10.770

Well whatever you want to do. There are two libraries out there jSoup and HtmlParser which I tend to use. Please check them out. Also I see bearly XHTML in the wild anymore. Its more about HTML5 (which does not have an XHTML counterpart) nowadays.

[Update]

I mention JSoup and HtmlParser since they are fault tollerant in a way the browser is. Please check if they suite you since they are very good at dealing with malformed and damaged HTML text. Create a DOM out of your HTML and write it back to string you should get rid of the damaged tags also you can filter the DOM by yourself and remove even more content if you have to.

PS: I guess the XML decade is finally (and gladly) over. Today JSON is going to be overused.

score -1 · Answer 7 · answered Feb 26 '15 at 07:55

A third potential answer I would consider as a potential solution is not to work with strings ins the first place.

When I remember correctly there are DOM tree representations that work closely with the underlying string presentation. Therefore they are character exact. I wrote one myself but I think jSoup has such a mode. Since there are a lot of parsers out there you should be able to find one that actually does.

With such a parser you can easily see which tag runs from what string position to another. Actually those parsers maintain a String of the document and alter it but only store range information like start and stop positions within the document avoiding to multiply those information for nested nodes.

Therefore you can find the most outer node for a given position, know exactly from what to where and easily can decide if this tag (including all its children) can be used to be presented within your snippet. So you will have the chance to print complete text nodes and alike without the risk to only present partial tag information or headline text and alike.

If you do not find a parser that suites you on this, you can ask me for advise.

Java Library to truncate html strings?

7 Answers7

First solution

Second solution

Linked