CSQuery Parsing non-english text

Question

I am using CSQuery to parse a website in arabic. When I use text() function it returns the text as is, however when I use html() function it uses html encoding. for example this is my html tag:

<div>تعلن عن إرسالها مركبة فضائية للمريخ قريباً جداً</div>

when i use:

dom["div"].Text();

It returns: "تعلن عن إرسالها مركبة فضائية للمريخ قريباً جداً". However when I use:

dom["div"].Html();

It returns:

&amp;#1578;&amp;#1593;&amp;#1604;&amp;#1606; &amp;#1593;&amp;#1606; &amp;#1573;&amp;#1585;&amp;#1587;&amp;#1575;&amp;#1604;&amp;#1607;&amp;#1575; &amp;#1605;&amp;#1585;&amp;#1603;&amp;#1576;&amp;#1577; &amp;#1601;&amp;#1590;&amp;#1575;&amp;#1574;&amp;#1610;&amp;#1577; &amp;#1604;&amp;#1604;&amp;#1605;&amp;#1585;&amp;#1610;&amp;#1582; &amp;#1602;&amp;#1585;&amp;#1610;&amp;#1576;&amp;#1575;&amp;#1611; &amp;#1580;&amp;#1583;&amp;#1575;&amp;#1611;

The question is how can I use Html while preserving the actual text without encoding? I need the Html() function to retrieve any existing tags inside the selector tag.

Edit: here's the content type of the original html page:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Why I don't see any difference between the return values of `Text()` and `Html()`? — Khalil Khalaf, May 27 '16 at 13:55
@FirstStep : Because your browser automatically encodes it since it was not formatted correctly. :) — Visual Vincent, May 27 '16 at 14:02
Why it's not encoded correctly? This is a basic utf-8 encoding? — Lamar, May 27 '16 at 14:14
FYI - CSQuery is no longer maintained. The maintainer recommends AngleSharp as replacement. https://github.com/jamietre/CsQuery — ShuberFu, May 27 '16 at 14:15
I don't have access to the full code this moment, but this is pretty much it. I create a dom object from html source string. — Lamar, May 27 '16 at 14:42

score 0 · Accepted Answer · answered May 28 '16 at 09:14

0

I ended up using System.Net.WebUtility.HtmlDecode() to decode the output of Html() function.

answered May 28 '16 at 09:14

Lamar

1,761
4
24
50

score 0 · Answer 2 · answered May 09 '19 at 12:43

0

In case you're scraping an HTML page using WebClient (which is my case), this should help you

var client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;

answered May 09 '19 at 12:43

Mohamed Anas Ben Othman

199
1
11

CSQuery Parsing non-english text

2 Answers2