0

I am using CSQuery to parse a website in arabic. When I use text() function it returns the text as is, however when I use html() function it uses html encoding. for example this is my html tag:

<div>تعلن عن إرسالها مركبة فضائية للمريخ قريباً جداً</div>

when i use:

dom["div"].Text();

It returns: "تعلن عن إرسالها مركبة فضائية للمريخ قريباً جداً". However when I use:

dom["div"].Html();

It returns:

&amp;#1578;&amp;#1593;&amp;#1604;&amp;#1606; &amp;#1593;&amp;#1606; &amp;#1573;&amp;#1585;&amp;#1587;&amp;#1575;&amp;#1604;&amp;#1607;&amp;#1575; &amp;#1605;&amp;#1585;&amp;#1603;&amp;#1576;&amp;#1577; &amp;#1601;&amp;#1590;&amp;#1575;&amp;#1574;&amp;#1610;&amp;#1577; &amp;#1604;&amp;#1604;&amp;#1605;&amp;#1585;&amp;#1610;&amp;#1582; &amp;#1602;&amp;#1585;&amp;#1610;&amp;#1576;&amp;#1575;&amp;#1611; &amp;#1580;&amp;#1583;&amp;#1575;&amp;#1611;

The question is how can I use Html while preserving the actual text without encoding? I need the Html() function to retrieve any existing tags inside the selector tag.

Edit: here's the content type of the original html page:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Lamar
  • 1,761
  • 4
  • 24
  • 50

2 Answers2

0

I ended up using System.Net.WebUtility.HtmlDecode() to decode the output of Html() function.

Lamar
  • 1,761
  • 4
  • 24
  • 50
0

In case you're scraping an HTML page using WebClient (which is my case), this should help you

var client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;