14

Does anyone have a c# variation of this?

This is so I can take some html and display it without breaking as a summary lead in to an article?

Truncate text containing HTML, ignoring tags

Save me from reinventing the wheel!

Edit

Sorry, new here, and your right, should have phrased the question better, heres a bit more info

I wish to take a html string and truncate it to a set number of words (or even char length) so I can then show the start of it as a summary (which then leads to the main article). I wish to preserve the html so I can show the links etc in preview.

The main issue I have to solve is the fact that we may well end up with unclosed html tags if we truncate in the middle of 1 or more tags!

The idea I have for solution is to

  1. truncate the html to N words (words better but chars ok) first (be sure not to stop in the middle of a tag and truncate a require attribute)

  2. work through the opened html tags in this truncated string (maybe stick them on stack as I go?)

  3. then work through the closing tags and ensure they match the ones on stack as I pop them off?

  4. if any open tags left on stack after this, then write them to end of truncated string and html should be good to go!!!!

Edit 12/11/2009

  • Here is what I have bumbled together so far as a unittest file in VS2008, this 'may' help someone in future
  • My hack attempts based on Jan code are at top for char version + word version (DISCLAIMER: this is dirty rough code!! on my part)
  • I assume working with 'well-formed' HTML in all cases (but not necessarily a full document with a root node as per XML version)
  • Abels XML version is at bottom, but not yet got round to fully getting tests to run on this yet (plus need to understand the code) ...
  • I will update when I get chance to refine
  • having trouble with posting code? is there no upload facility on stack?

Thanks for all comments :)

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.XPath;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace PINET40TestProject
{
    [TestClass]
    public class UtilityUnitTest
    {
        public static string TruncateHTMLSafeishChar(string text, int charCount)
        {
            bool inTag = false;
            int cntr = 0;
            int cntrContent = 0;

            // loop through html, counting only viewable content
            foreach (Char c in text)
            {
                if (cntrContent == charCount) break;
                cntr++;
                if (c == '<')
                {
                    inTag = true;
                    continue;
                }

                if (c == '>')
                {
                    inTag = false;
                    continue;
                }
                if (!inTag) cntrContent++;
            }

            string substr = text.Substring(0, cntr);

            //search for nonclosed tags        
            MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
            MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

            // create stack          
            Stack<string> opentagsStack = new Stack<string>();
            Stack<string> closedtagsStack = new Stack<string>();

            // to be honest, this seemed like a good idea then I got lost along the way 
            // so logic is probably hanging by a thread!! 
            foreach (Match tag in openedTags)
            {
                string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
                // strip any attributes, sure we can use regex for this!
                if (openedtag.IndexOf(" ") >= 0)
                {
                    openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
                }

                // ignore brs as self-closed
                if (openedtag.Trim() != "br")
                {
                    opentagsStack.Push(openedtag);
                }
            }

            foreach (Match tag in closedTags)
            {
                string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
                closedtagsStack.Push(closedtag);
            }

            if (closedtagsStack.Count < opentagsStack.Count)
            {
                while (opentagsStack.Count > 0)
                {
                    string tagstr = opentagsStack.Pop();

                    if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
                    {
                        substr += "</" + tagstr + ">";
                    }
                    else
                    {
                        closedtagsStack.Pop();
                    }
                }
            }

            return substr;
        }

        public static string TruncateHTMLSafeishWord(string text, int wordCount)
        {
            bool inTag = false;
            int cntr = 0;
            int cntrWords = 0;
            Char lastc = ' ';

            // loop through html, counting only viewable content
            foreach (Char c in text)
            {
                if (cntrWords == wordCount) break;
                cntr++;
                if (c == '<')
                {
                    inTag = true;
                    continue;
                }

                if (c == '>')
                {
                    inTag = false;
                    continue;
                }
                if (!inTag)
                {
                    // do not count double spaces, and a space not in a tag counts as a word
                    if (c == 32 && lastc != 32)
                        cntrWords++;
                }
            }

            string substr = text.Substring(0, cntr) + " ...";

            //search for nonclosed tags        
            MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
            MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

            // create stack          
            Stack<string> opentagsStack = new Stack<string>();
            Stack<string> closedtagsStack = new Stack<string>();

            foreach (Match tag in openedTags)
            {
                string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
                // strip any attributes, sure we can use regex for this!
                if (openedtag.IndexOf(" ") >= 0)
                {
                    openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
                }

                // ignore brs as self-closed
                if (openedtag.Trim() != "br")
                {
                    opentagsStack.Push(openedtag);
                }
            }

            foreach (Match tag in closedTags)
            {
                string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
                closedtagsStack.Push(closedtag);
            }

            if (closedtagsStack.Count < opentagsStack.Count)
            {
                while (opentagsStack.Count > 0)
                {
                    string tagstr = opentagsStack.Pop();

                    if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
                    {
                        substr += "</" + tagstr + ">";
                    }
                    else
                    {
                        closedtagsStack.Pop();
                    }
                }
            }

            return substr;
        }

        public static string TruncateHTMLSafeishCharXML(string text, int charCount)
        {
            // your data, probably comes from somewhere, or as params to a methodint 
            XmlDocument xml = new XmlDocument();
            xml.LoadXml(text);
            // create a navigator, this is our primary tool
            XPathNavigator navigator = xml.CreateNavigator();
            XPathNavigator breakPoint = null;

            // find the text node we need:
            while (navigator.MoveToFollowing(XPathNodeType.Text))
            {
                string lastText = navigator.Value.Substring(0, Math.Min(charCount, navigator.Value.Length));
                charCount -= navigator.Value.Length;
                if (charCount <= 0)
                {
                    // truncate the last text. Here goes your "search word boundary" code:        
                    navigator.SetValue(lastText);
                    breakPoint = navigator.Clone();
                    break;
                }
            }

            // first remove text nodes, because Microsoft unfortunately merges them without asking
            while (navigator.MoveToFollowing(XPathNodeType.Text))
            {
                if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent, then move the rest
            navigator.MoveTo(breakPoint);
            while (navigator.MoveToFollowing(XPathNodeType.Element))
            {
                if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent
            // then remove *all* empty nodes to clean up (not necessary):
            // TODO, add empty elements like <br />, <img /> as exclusion
            navigator.MoveToRoot();
            while (navigator.MoveToFollowing(XPathNodeType.Element))
            {
                while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent
            navigator.MoveToRoot();
            return navigator.InnerXml;
        }

        [TestMethod]
        public void TestTruncateHTMLSafeish()
        {
            // Case where we just make it to start of HREF (so effectively an empty link)

            // 'simple' nested none attributed tags
            Assert.AreEqual(@"<h1>1234</h1><b><i>56789</i>012</b>",
            TruncateHTMLSafeishChar(
                @"<h1>1234</h1><b><i>56789</i>012345</b>",
                12));

            // In middle of a!
            Assert.AreEqual(@"<h1>1234</h1><a href=""testurl""><b>567</b></a>",
            TruncateHTMLSafeishChar(
                @"<h1>1234</h1><a href=""testurl""><b>5678</b></a><i><strong>some italic nested in string</strong></i>",
                7));

            // more
            Assert.AreEqual(@"<div><b><i><strong>1</strong></i></b></div>",
            TruncateHTMLSafeishChar(
                @"<div><b><i><strong>12</strong></i></b></div>",
                1));

            // br
            Assert.AreEqual(@"<h1>1 3 5</h1><br />6",
            TruncateHTMLSafeishChar(
                @"<h1>1 3 5</h1><br />678<br />",
                6));
        }

        [TestMethod]
        public void TestTruncateHTMLSafeishWord()
        {
            // zero case
            Assert.AreEqual(@" ...",
                            TruncateHTMLSafeishWord(
                                @"",
                               5));

            // 'simple' nested none attributed tags
            Assert.AreEqual(@"<h1>one two <br /></h1><b><i>three  ...</i></b>",
            TruncateHTMLSafeishWord(
                @"<h1>one two <br /></h1><b><i>three </i>four</b>",
                3), "we have added ' ...' to end of summary");

            // In middle of a!
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four  ...</b></a>",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i>",
                4));

            // start of h1
            Assert.AreEqual(@"<h1>one two three  ...</h1>",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                3));

            // more than words available
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                99));
        }

        [TestMethod]
        public void TestTruncateHTMLSafeishWordXML()
        {
            // zero case
            Assert.AreEqual(@" ...",
                            TruncateHTMLSafeishWord(
                                @"",
                               5));

            // 'simple' nested none attributed tags
            string output = TruncateHTMLSafeishCharXML(
                @"<body><h1>one two </h1><b><i>three </i>four</b></body>",
                13);
            Assert.AreEqual(@"<body>\r\n  <h1>one two </h1>\r\n  <b>\r\n    <i>three</i>\r\n  </b>\r\n</body>", output,
             "XML version, no ... yet and addeds '\r\n  + spaces?' to format document");

            // In middle of a!
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four  ...</b></a>",
            TruncateHTMLSafeishCharXML(
                @"<body><h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i></body>",
                4));

            // start of h1
            Assert.AreEqual(@"<h1>one two three  ...</h1>",
            TruncateHTMLSafeishCharXML(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                3));

            // more than words available
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
            TruncateHTMLSafeishCharXML(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                99));
        }
    }
}
Community
  • 1
  • 1
WickedW
  • 2,331
  • 4
  • 24
  • 54
  • 1
    What if someone edits the post you now link to? IMO, it's better to describe your question as precisely as possible here, in your own post. – Bart Kiers Nov 11 '09 at 12:09
  • Thanks for the update, I'll workout a solution based on the current method ;-) – Abel Nov 11 '09 at 14:08
  • Thanks all, I am trying some dirty way here myself, but not sure what mileage it has, Ill post if I can a decent workable solution – WickedW Nov 11 '09 at 14:54
  • If you're fed up with trying the dirty ways, I've tried to come up with a "clean way" algorithm, which gives enough room for expansion. It will correctly cut a node, regardless where it is. The good thing of using XML (as with XHTML) is that any mistakes you make will be caught by the system with a nice exception: early degradation principle. – Abel Nov 11 '09 at 16:10
  • Thanks Abel, trying to integrate your code. – WickedW Nov 12 '09 at 11:44
  • On *"having trouble with posting code? is there no upload facility on stack?"* >>> normally, we use the inline code facility, but you can place code online and link to it (same with images). StackOverflow is all about questions and answers. We strive to keep the question clean (i.e.: they are the *question* and should not include the *answer*). If there are problems with parts of the code, you'll have more chances for success asking with comments unders the answers themselves, or ask a new question when it's about a specific non-related (or separable) issue. – Abel Nov 12 '09 at 17:21
  • PS: note that some of your samples do not have a root note. HTML and XHTML will always have a root node (either `html` or `body`). To have them working with my code, make sure your (X)HTML tests are correct. – Abel Nov 18 '09 at 13:28

4 Answers4

11

EDIT: See below for a full solution, this first attempt strips the HTML, the second does not

Let's summarize what you want:

  • No HTML in the result
  • It should take any valid data inside <body>
  • It has a fixed maximum length

If you HTML is XHTML this becomes trivial (and, while I haven't seen the PHP solution, I doubt very much they use a similar approach, but I believe this is understandable and rather easy):

XmlDocument xml = new XmlDocument();

// replace the following line with the content of your full XHTML
xml.LoadXml(@"<body><p>some <i>text</i>here</p><div>that needs stripping</div></body>");

// Get all textnodes under <body> (twice "//" is on purpose)
XmlNodeList nodes = xml.SelectNodes("//body//text()");

// loop through the text nodes, replace this with whatever you like to do with the text
foreach (var node in nodes)
{
    Debug.WriteLine(((XmlCharacterData)node).Value);
}

Note: spaces etc will be preserved. This is usually a good thing.

If you don't have XHTML, you can use the HTML Agility Pack, which let's you do about the same for plain old HTML (it internally converts it to some DOM). I haven't tried it, but it should run rather smooth.


BIG EDIT:

Actual solution

In a little comment I promised to take the XHTML / XmlDocument approach and use that for a typesafe method for splitting your HTML based on text length, but keeping HTML code. I took the following HTML, the code breaks it correctly in the middle of needs, removes the rest, removes empty nodes and automatically closes any open elements.

The sample HTML:

<body>
    <p><tt>some<u><i>text</i>here</u></tt></p>
    <div>that <b><i>needs <span>str</span>ip</i></b><s>ping</s></div>
</body>

The code, tested and working with any kind of input (ok, granted, I just did some tests and code may contain bugs, let me know if you find them!).

// your data, probably comes from somewhere, or as params to a method
int lengthAvailable = 20;
XmlDocument xml = new XmlDocument();
xml.LoadXml(@"place-html-code-here-left-out-for-brevity");

// create a navigator, this is our primary tool
XPathNavigator navigator = xml.CreateNavigator();
XPathNavigator breakPoint = null;


string lastText = "";

// find the text node we need:
while (navigator.MoveToFollowing(XPathNodeType.Text))
{
    lastText = navigator.Value.Substring(0, Math.Min(lengthAvailable, navigator.Value.Length));
    lengthAvailable -= navigator.Value.Length;

    if (lengthAvailable <= 0)
    {
        // truncate the last text. Here goes your "search word boundary" code:
        navigator.SetValue(lastText);
        breakPoint = navigator.Clone();
        break;
    }
}

// first remove text nodes, because Microsoft unfortunately merges them without asking
while (navigator.MoveToFollowing(XPathNodeType.Text))
    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
        navigator.DeleteSelf();   // moves to parent

// then move the rest
navigator.MoveTo(breakPoint);
while (navigator.MoveToFollowing(XPathNodeType.Element))
    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
        navigator.DeleteSelf();   // moves to parent

// then remove *all* empty nodes to clean up (not necessary): 
// TODO, add empty elements like <br />, <img /> as exclusion
navigator.MoveToRoot();
while (navigator.MoveToFollowing(XPathNodeType.Element))
    while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
        navigator.DeleteSelf();  // moves to parent

navigator.MoveToRoot();
Debug.WriteLine(navigator.InnerXml);

How the code works

The code does the following things, in that order:

  1. It goes through all text nodes, until the text size expands beyond the allowed limit, in which case it truncates that node. This automatically deals correctly with &gt; etc as one character.
  2. It then shortens the text of the "breaking node" and resets it. It clones the XPathNavigator at this point as we need to remember this "breaking point".
  3. To workaround an MS bug (an ancient one, actually), we have to remove any remaining text nodes first, that follow the breaking point, otherwise we risk auto-merging of text nodes when they end up as siblings of each other. Note: DeleteSelf is handy, but moves the navigator position to its parent, which is why we need to check the current position against the "breaking point" position remembered in the previous step.
  4. Then we do what we wanted to do in the first place: remove any node following the breaking point.
  5. Not a necessary step: cleaning up the code and removing any empty elements. This action is merely to clean up the HTML and/or to filter for specific (dis)allowed elements. It can be left out.
  6. Go back to "root" and get the content as a string with InnerXml.

That's all, rather simple, though it may look a bit daunting at first sight.

PS: the same would be way easier to read and understand were you to use XSLT, which is the ideal tool for this type of jobs.

Update: added extended code sample, based on edited question
Update: added a bit of explanation

Abel
  • 56,041
  • 24
  • 146
  • 247
  • HTML Agility Pack and SgmlReader both handles the "HTML to XHTML" need quite nicely. I personally like SgmlReader better, but both are good. – Asbjørn Ulsberg Nov 11 '09 at 12:45
  • This isn't what is asked in the original question. The formatting should be preserved; yet shouldn't count for the number of chars requested. – Jan Jongboom Nov 11 '09 at 12:45
  • @Jan: where in the question does it say so? But I'd be happy to update the same method including the formatting / counting issue – Abel Nov 11 '09 at 12:48
  • The part "What I would want is:" in the referenced question. So 26 char summary, should be 26 chars; PLUS the HTML, etc. See stian.net's answer. – Jan Jongboom Nov 11 '09 at 12:55
  • Thanks Jan, I didn't read so far up. Question is meanwhile edited with full description, my answer is edited as well (see bottom half) – Abel Nov 11 '09 at 16:08
  • As soon as you add an non-breaking space to the HTML your code will break. – Dan Diplo Nov 11 '09 at 16:11
  • Thanks Abel for this, I will hook it up and see how it fairs, much appreciated!!!! – WickedW Nov 11 '09 at 16:18
  • @Dan: you're right, though it shouldn't, but MS misbehaves against the XML recommendation. You can resolve it in some ways, but easiest is: use Silverlight libs to solve the issue. See http://blogs.msdn.com/xmlteam/archive/2008/08/14/introducing-the-xmlpreloadedresolver.aspx – Abel Nov 11 '09 at 17:35
  • This doesn't handle arbitrary sections of HTML, adding some dummy root tags either side of the input string, i.e. String.Format("{0}", inputString), then just before final output (above the Debug tag) add: navigator.MoveToFirstChild(); which will give you back the basic HTML block. Cracking stuff though +1 – Lazarus Apr 15 '10 at 22:40
  • @Lazarus: in review, I agree that the implementation above is a bit limited, but that's perhaps the nature of such short snippets. Glad you like it :) – Abel Apr 18 '10 at 19:39
  • 2 years on still helpful :) many thanks. Couple of notes... if you're looking to format the text from an AJAX HtmlEditorExtender, make sure you perform the following: _content = _content.Replace(@"
    ", @"
    "); _content = _content.Replace(@"
    ", @"
    "); _content = @"" + _content + @""; xml.LoadXml(_content);
    – tutts Aug 24 '11 at 14:41
  • @rocky: glad it is still of help! Your replace won't be necessary when the content is valid XHTML to begin with, a matter of good coding and the correct doctypes. – Abel Aug 24 '11 at 16:33
  • yep I hear you, although this is the ajax control thats formatting the text, not my code. So you need to validate it before using it. Probably should be something to be passed onto the .net ajax team :) – tutts Aug 25 '11 at 13:24
4

If you want to maintain the html tags you can use this gist which I have recently published. https://gist.github.com/2413598

It uses XmlReader/XmlWriter. It is not production ready, i.e. you'd probably want SgmlReader or HtmlAgilityPack AND you'd want try-catches and choose some fallback...

Jaap
  • 3,081
  • 2
  • 29
  • 50
2

Ok. This should work (dirty code alert):

        string blah = "hoi <strong>dit <em>is test bla meer tekst</em></strong>";
        int aantalChars = 10;


        bool inTag = false;
        int cntr = 0;
        int cntrContent = 0;
        foreach (Char c in blah)
        {
            if (cntrContent == aantalChars) break;



            cntr++;
            if (c == '<')
            {
                inTag = true;
                continue;
            }
            else if (c == '>')
            {
                inTag = false;
                continue;
            }

            if (!inTag) cntrContent++;
        }

        string substr = blah.Substring(0, cntr);

        //search for nonclosed tags
        MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
        MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

        for (int i =openedTags.Count - closedTags.Count; i >= 1; i--)
        {
            string closingTag = "</" + openedTags[closedTags.Count + i - 1].Value.Substring(1);
            substr += closingTag;
        }
Jan Jongboom
  • 26,598
  • 9
  • 83
  • 120
0

This is complicated and, as far as I can see, none of the PHP solutions is perfect. What if the text is:

substr("Hello, my <strong>name is <em>Sam</em>. I&acute;m a 
  web developer.  And this text is very long and all the text 
  is inside the sam html tag..</strong>",0,26)."..."

You will actually have to iterate through the whole text to find the end of the starting strong-tag.

My advice to you is to strip all html in the summary. Remember to use html-sanitizing if you are showing users own html-code!

Good luck :)

Community
  • 1
  • 1
  • Stripping HTML is definitely easiest. But using XML + XPath (for XHTML, or sanitized HTML) to do the job makes this rather trivial. Though the bulk of the work has been getting "removing the rest" right, complex or hard is not the word I'd choose. But, doing the same with text parsing techniques is way harder (which is what PHP uses). – Abel Nov 11 '09 at 16:14