Correctly removing html entities from a string

Question

I have a problem with removing html entities from strings. I try System.Web.HttpUtility.HtmlDecode, and would like to see   being replaced with a regular space. Instead, a weird hex code is returned. I have read the following two topics and learned that this is most probably an encoding issue, but I can't find a way to solve it.

Removing HTML entities in strings

How do I remove all HTML tags from a string without knowing which tags are in it? ("I realize that...", Thierry_S)

The source string that should be stripped from html codes and entities is saved in a database with SQL_Latin1_General_CP1_CI_AI as collation, but for my unit test, I simply created a test string in Visual Studio, of which the encoding is not necessarily the same as the encoding of the data that is stored in the database.

My unit test asserts 'Not Equal' since the   is not replaced with a regular space. Initially, it returned 2C, but after lots of testing and trying to convert from some encoding to another, it now returns A0 even though I have removed all encoding changing code from my function.

My question is two-fold:

How can I make my unit test pass?
Am I testing correctly, since the database encoding could be different from the text I have manually typed in my unit test?

My function:

public static string StripHtml(string text)
{
    // Remove html entities like &nbsp;
    text = System.Net.WebUtility.HtmlDecode(text);

    // Init Html Agility Pack
    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(text);

    // Return without html tags
    return htmlDoc.DocumentNode.InnerText;
}

My unit test:

public void StripHtmlTest()
{
    // arrange
    string html = "<p>This is&nbsp;a very <b>fat, <i>italic</i> and <u>underlined</u> text,<!-- foo bar --> sigh.</p> And 6 < 9 but > 3.";
    string actual;
    string expected = "This is a very fat, italic and underlined text, sigh. And 6 < 9 but > 3.";

    // act
    actual = StaticRepository.StripHtml(html);

    // assert
    Assert.AreEqual(expected, actual);
}

Test result:

Message: Assert.AreEqual failed. Expected:<This is a very fat, italic and underlined text, sigh. And 6 < 9 but > 3.>. Actual:<This is a very fat, italic and underlined text, sigh. And 6 < 9 but > 3.>.

Test result in HEX: Text

vasil oreshenski · Accepted Answer · 2020-01-09T16:07:11.373

Well   is not a 'regular' space. When you are using System.Net.WebUtility.HtmlDecode it will return the textual representation of the named html entity which is ' '. It looks like regular whitespace but it has different meaning. The decimal representation of nbsp is actually 160 which in hex is A0, so your unit test and decoding are working correctly.
If you want to replace nbsp with regular whitespace you have several options, the easiest of which will be execute simple replace before the decoding:

// where the second argument is whitespace char with decimal representation 32
text = text.Replace("&nbsp;", " ");

About the initial running: The hex value 2C is 44 in decimal which is the symbol ','(comma). Is it possible that you just have looked at the wrong character ?

About sql collation: the latin general is capable of storing nbsp symbols so.. i think this is not a problem.

Thank you for your reply. I misspelled 2C als it was C2, so not a comma. But nevertheless, your reply has helped me by clearing up the difference between a regular space and a non breakable space. — Leonard, Jan 10 '20 at 07:48

Correctly removing html entities from a string

1 Answers1