I have a problem with removing html entities from strings. I try System.Web.HttpUtility.HtmlDecode
, and would like to see
being replaced with a regular space. Instead, a weird hex code is returned. I have read the following two topics and learned that this is most probably an encoding issue, but I can't find a way to solve it.
Removing HTML entities in strings
How do I remove all HTML tags from a string without knowing which tags are in it? ("I realize that...", Thierry_S)
The source string that should be stripped from html codes and entities is saved in a database with SQL_Latin1_General_CP1_CI_AI
as collation, but for my unit test, I simply created a test string in Visual Studio, of which the encoding is not necessarily the same as the encoding of the data that is stored in the database.
My unit test asserts 'Not Equal' since the
is not replaced with a regular space. Initially, it returned 2C
, but after lots of testing and trying to convert from some encoding to another, it now returns A0
even though I have removed all encoding changing code from my function.
My question is two-fold:
- How can I make my unit test pass?
- Am I testing correctly, since the database encoding could be different from the text I have manually typed in my unit test?
My function:
public static string StripHtml(string text)
{
// Remove html entities like
text = System.Net.WebUtility.HtmlDecode(text);
// Init Html Agility Pack
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(text);
// Return without html tags
return htmlDoc.DocumentNode.InnerText;
}
My unit test:
public void StripHtmlTest()
{
// arrange
string html = "<p>This is a very <b>fat, <i>italic</i> and <u>underlined</u> text,<!-- foo bar --> sigh.</p> And 6 < 9 but > 3.";
string actual;
string expected = "This is a very fat, italic and underlined text, sigh. And 6 < 9 but > 3.";
// act
actual = StaticRepository.StripHtml(html);
// assert
Assert.AreEqual(expected, actual);
}
Test result:
Message: Assert.AreEqual failed. Expected:<This is a very fat, italic and underlined text, sigh. And 6 < 9 but > 3.>. Actual:<This is a very fat, italic and underlined text, sigh. And 6 < 9 but > 3.>.
Test result in HEX: