10

Mainly when we shorten/truncate textual content we usually just truncate it at specific character index. That's already complicated in HTML anyway, but I want to truncate my HTML content (generated using content-editable div) using different measures:

  1. I would define character index N that will serve as truncation startpoint limit
  2. Algorithm will check whether content is at least N characters long (text only; not counting tags); if it's not, it will just return the whole content
  3. It would then check from N-X to N+X character position (text only) and search for ends of block nodes; X is predefined offset value and likely about N/5 to N/4;
  4. If several block nodes end within this range, algorithm will select the one that ends closest to limit index N
  5. If no block node ends within this range it would then find closest word boundary within the same range and select index closest to N and truncate at that position.
  6. Return truncated content with valid HTML (all tags closed at the end)

My content-editable generated content may consist of paragraphs (with line breaks), preformatted code blocks, block quotes, ordered and unordered lists, headers, bolds and italics (which are inline nodes and shouldn't count in truncation process) etc. Final implementation will of course define which elements specifically are possible truncation candidates. Headers even though they are block HTML elements will not count as truncation points as we don't want widowed headers. Paragraphs, list individual items, whole ordered and unordered lists, block quotes, preformatted blocks, void elements etc. are good ones. Headers and all inline block elements aren't.

Example

Let's take this very stackoverflow question as an example of HTML content that I would like to truncate. Let's set truncation limit to 1000 with offset of 250 characters (1/4).

This DotNetFiddle shows text of this question while also adding limit markers inside of it (|MIN| which represents character 750, |LIMIT| representing character 1000 and |MAX| that represents character 1250).

As can be seen from example the closest truncation boundary between two block nodes to character 1000 is between </OL> and P (My content-editable generated...). This means that my HTML should be truncated right between these two tags which would result in a little bit less than 1000 characters long content text wise, but kept truncated content meaningful because it wouldn't just truncate somewhere in the middle of some text passage.

I hope this explains how things should be working related to this algorithm.

The problem

The first problem I'm seeing here is that I'm dealing with nested structure like HTML. I also have to detect different elements (only block elements and no inline ones). And last but not least I will have to only count certain characters in my string and ignore those that belong to tags.

Possible solutions

  1. I could parse my content manually by creating some object tree representing content nodes and their hierarchy
  2. I could convert HTML to something easier to manage like markdown and then simply search for closest new line to my provided index N and convert back to HTML
  3. Use something like HTML Agility Pack and replace my #1 parsing with it and then somehow use XPath to extract block nodes and truncate content

Second thoughts

  • I'm sure I could make it by doing #1 but it feels I'm reinventing the wheel.
  • I don't think there's any C# library for #2 so I should be doing HTML to Markdown manually as well or run i.e. pandoc as an external process.
  • I could use HAP as it's great at manipulating HTML, but I'm not sure whether my truncation would be simple enough by using it. I'm afraid the bulk of processing will still be outside HAP in my custom code

How should one approach such truncation algorithm? My head just seems to be too tired to come to a consensus (or solution).

Robert Koritnik
  • 103,639
  • 52
  • 277
  • 404
  • There's no magic bullet for this of couse, but I would use HAP, HAP can get you all texts with just one xpath: `//text()`. And then, each node also has an `XPath` property so you can walk back and forth the tree from these text elements. These text element content can be changed very easily using the `InnerHtml` property. Lastly, HAP will close unclosed elements automatically when outputing HTML. – Simon Mourier Jun 24 '15 at 06:21
  • @SimonMourier: Fancy showing some code in an answer? – Robert Koritnik Jun 29 '15 at 07:18
  • Do you have some sample input and expected output? – Simon Mourier Jun 29 '15 at 10:09
  • 1
    sorry, just wanna say something off topic. there is nothing wrong with reinventing the wheel, if you think you can make it better or simpler, why not? after all, we reach our current wheel from a wooden wheel centuries ago, :p – am05mhz Jun 30 '15 at 06:02

3 Answers3

3

Here is some sample code that can truncate the inner text. It uses the recursive capability of the InnerText property and CloneNode method.

    public static HtmlNode TruncateInnerText(HtmlNode node, int length)
    {
        if (node == null)
            throw new ArgumentNullException("node");

        // nothing to do?
        if (node.InnerText.Length < length)
            return node;

        HtmlNode clone = node.CloneNode(false);
        TruncateInnerText(node, clone, clone, length);
        return clone;
    }

    private static void TruncateInnerText(HtmlNode source, HtmlNode root, HtmlNode current, int length)
    {
        HtmlNode childClone;
        foreach (HtmlNode child in source.ChildNodes)
        {
            // is expected size is ok?
            int expectedSize = child.InnerText.Length + root.InnerText.Length;
            if (expectedSize <= length)
            {
                // yes, just clone the whole hierarchy
                childClone = child.CloneNode(true);
                current.ChildNodes.Add(childClone);
                continue;
            }

            // is it a text node? then crop it
            HtmlTextNode text = child as HtmlTextNode;
            if (text != null)
            {
                int remove = expectedSize - length;
                childClone = root.OwnerDocument.CreateTextNode(text.InnerText.Substring(0, text.InnerText.Length - remove));
                current.ChildNodes.Add(childClone);
                return;
            }

            // it's not a text node, shallow clone and dive in
            childClone = child.CloneNode(false);
            current.ChildNodes.Add(childClone);
            TruncateInnerText(child, root, childClone, length);
        }
    }

And a sample C# console app that will scrap this question as an example, and truncate it to 500 characters.

  class Program
  {
      static void Main(string[] args)
      {
          var web = new HtmlWeb();
          var doc = web.Load("http://stackoverflow.com/questions/30926684/truncating-html-content-at-the-end-of-text-blocks-block-elements");
          var post = doc.DocumentNode.SelectSingleNode("//td[@class='postcell']//div[@class='post-text']");
          var truncated = TruncateInnerText(post, 500);
          Console.WriteLine(truncated.OuterHtml);
          Console.WriteLine("Size: " + truncated.InnerText.Length);
      }
  }

When ran it, it should display this:

<div class="post-text" itemprop="text">

<p>Mainly when we shorten/truncate textual content we usually just truncate it at specific character index. That's already complicated in HTML anyway, but I want to truncate my HTML content (generated using content-editable <code>div</code>) using different measures:</p>

<ol>
<li>I would define character index <code>N</code> that will serve as truncating startpoint <em>limit</em></li>
<li>Algorithm will check whether content is at least <code>N</code> characters long (text only; not counting tags); if it's not it will just return the whole content</li>
<li>It would then</li></ol></div>
Size: 500

Note: I have not truncated at word boundary, just at character boundary, and no, it's not at all following the suggestions in my comment :-)

Simon Mourier
  • 132,049
  • 21
  • 248
  • 298
  • What I'm after is not character, nor word boundary but rather **block element boundary**. So trimmed text only content may be shorter or longer than the specified limit but within some range `limit-offset < limit < limit + offset` as long as the block element's end is closest to `limit`. – Robert Koritnik Jun 30 '15 at 11:31
  • I don't understand what you mean. Maybe my answer does it, have you tried? Or please give a sample. – Simon Mourier Jun 30 '15 at 13:44
  • Yes I have tried and seen what your code does. As a consequence to apparently not explaining this too well I've edited my question while **also providing runnable fiddle** where you can actually see what and how content should be truncated. I've even used your code that loads this question. – Robert Koritnik Jul 02 '15 at 07:41
  • If you just do a `return` on `text != null`, maybe that will do what you want. I think that's a good start anyway. But your algorithm seems ambiguous to me. I'm not sure there is always a solution with your N & X things. For example, if I just have one big text of size 2000 with N set to 1000 and X set to 250, what should I do? return a 0 lenght text? – Simon Mourier Jul 02 '15 at 08:18
  • See #5 in my question that covers this exact scenario that you're asking about. – Robert Koritnik Jul 02 '15 at 08:53
0
   private void RemoveEmpty(HtmlNode node){
       var parent = node.Parent;
       node.Remove();
       if(parent==null)
           return;
       // remove parent if it is empty
       if(!parent.DescendantNodes.Any()){
           RemoveEmpty(parent);
       }
   }



private void Truncate(DocumentNode root, int maxLimit){

    var n = 0;
    HtmlTextNode lastNode = null;

    foreach(var node in root.DescendantNodes
         .OfType<HtmlTextNode>().ToArray()){
       var length = node.Text.Length;

       n+= length;
       if(n + length >= maxLimit){
            RemoveEmpty(node);
       }

    }
}

// you are left with only nodes that add up to your max limit characters.
Akash Kava
  • 39,066
  • 20
  • 121
  • 167
  • But this is not what I was asking for, as you may truncate content at the end of `` which isn't correct. And it also isn't truncating to the closest of your `maxLimit`. You're always truncating on `>=maxLimit` even though some block element may end just a character before `maxLimit`. – Robert Koritnik Jun 29 '15 at 19:28
  • I have just shown a small sample, you have to modify this logic in order to suite your need, it is difficult to know what you want without seeing any sample data. If you can show input and expected output, I can tweak it further. – Akash Kava Jun 30 '15 at 14:04
  • That's true. But apart from that your `if` condition (which should actually read `if (n > maxLimit)` as you're already adding `length` to `n` just before it) should also change as you're always truncating content to `<=maxLimit`. You're quite close actually if you'd change `if` condition to check current length and new length delta from `maxLimit`. It would then truncate correctly is any of these two would be less than offset `X` as defined in my question. – Robert Koritnik Jul 02 '15 at 09:06
  • But you're also having other problems with your code as you're changing enumeration during iteration which is a runtime error. The best way would be for you to write a [DotNetFiddle](http://dotnetfiddle.net) with HAP and see for yourself whether it works or not and how. – Robert Koritnik Jul 02 '15 at 09:20
  • I did ToArray before enumeration so I can modify it without any problems – Akash Kava Jul 02 '15 at 09:29
  • Right. I've left that out. But how do you get the resulting HTML then? You're removing from array so how do you access truncated content afterwards? – Robert Koritnik Jul 02 '15 at 10:26
  • I am not removing from array instead I am removing from parent, you can simply get root.innerHtml for modified html. – Akash Kava Jul 02 '15 at 10:28
  • Never mind the extra `if` conditions in [this code](https://dotnetfiddle.net/AdkYuj), but root node is still unchanged. It still holds the whole document. No nodes have been removed from it. – Robert Koritnik Jul 02 '15 at 10:31
  • Sure I will look and modify it by tomorrow. – Akash Kava Jul 02 '15 at 10:34
  • Problem is you are calling root.InnerHtml (HtmlAgilityPack caches this value) so even after you remove nodes, you will still get all the HTML, however, look at the last line at https://dotnetfiddle.net/dCi5ov , I added root.CloneNode(true).InnerHtml, this will give you fresh HTML with all nodes removed. – Akash Kava Jul 02 '15 at 17:52
-1

I will run over the whole DOM tree and keep counting the number of text chars that appear. Whenever I hit the limit (N) I will erase the extra characters of that text node and from there on I will just remove all text nodes.

I believe that is a safe way to keep all HTML+CSS structure while retaining only N characters.

Eduardo Ramos
  • 416
  • 3
  • 8