1

I receive a text like this in my program

Ký Sinh Trùng   - (2019)

Which is wrong and should be as follows

Ký Sinh Trùng  - (2019)

I used the following code but nothing happened

byte[] bytes = Encoding.Default.GetBytes(nodes.InnerText);
var myString = Encoding.UTF8.GetString(bytes);

how can i fix this?

Update: Full Code:

HtmlWeb Webget = new HtmlWeb();

var docx = await Webget.LoadFromWebAsync(@"https://isubtitles.org/search?kwd=parasite");

var items = docx.DocumentNode.SelectNodes("//div[@class='movie-list-info']");

foreach (var node in items)
 {
   var name = node?.SelectSingleNode(".//div/div[2]/h3/a");
   var xxxx = name?.InnerText;
   

   byte[] bytes = Encoding.UTF8.GetBytes(xxxx);
   var myString = Encoding.UTF8.GetString(bytes);
   Debug.WriteLine(myString);
   return;
 }
karma
  • 147
  • 7
  • 2
    Those characters appear to be HTML encoded. I would double check whether the source has these HTML encoded characters to determine whether HtmlAgilityPack is to blame. – phuzi Sep 03 '21 at 09:11
  • There's no problem at all. `ý` is the HTML-encoded form of `ý`, not UTF8. Browsers will display it just fine. This page is UTF8, which is why it can display `Ký Sinh Trùng` or `Αυτό Εδώ` without requiring explicit ... HTML encoding – Panagiotis Kanavos Sep 03 '21 at 09:37
  • UTF8 specifies how text is converted to bytes. It doesn't specify any kind of escape sequences. It's no different to ASCII/Latin1 in that regard. Almost all web sites use UTF8. – Panagiotis Kanavos Sep 03 '21 at 09:42

1 Answers1

1

That's just HTML encoded text. It's fine. If you need to decode it, then:

System.Net.WebUtility.HtmlDecode(theHtmlEncodedString)

https://learn.microsoft.com/en-us/dotnet/api/system.net.webutility.htmldecode?view=net-5.0

or (if you have System.Web loaded):

System.Web.HttpUtility.HtmlDecode(theHtmlEncodedString)

https://learn.microsoft.com/en-us/dotnet/api/system.web.httputility.htmldecode?view=net-5.0

spender
  • 117,338
  • 33
  • 229
  • 351
  • thank you it is now worked, one more question: If the text is not encoded, does it not cause a problem? – karma Sep 03 '21 at 09:43
  • @karma Only if it coincidentally contains escape sequences such as ` `, `<` etc. This seems unlikely. – spender Sep 03 '21 at 09:45