Issues decoding strings from Xml

Question

I have been given a large quantity of Xml's where I need to pull out parts of the text elements and reuse it for other purposes. (I am using XDocument to pull Xml data).

But, how do I decode the text contained in the elements? What is even the formatting used here? A few examples:

"What is the meaning of this&amp;reg; asks Sonny."
"The big centre cost 1&amp;#190; million pounds"
"... lost it. &amp;#174; The next ..."

I have tried HttpUtility.HtmlDecode but that did not do the trick. If I decode twice the "®" turns into a ® which is obviously not right.

Looks like ® are line breaks. The ® are probably question marks. The 190 one, I don't even know. Perhaps a dot or comma?

Any ideas would be welcome.

score 0 · Accepted Answer · answered Apr 06 '12 at 10:20

0

It does appear that the strings you show have been HTML encoded, and then XML encoded (or HTML again).

It is correct that &reg; -> ® -> ® (the registered trademark symbol) per the ISO Latin-1 entities - &#174; should behave the same way

Similarly &amp#190; would turn into a fraction representing three quarters.

answered Apr 06 '12 at 10:20

Rowland Shaw

37,700
14
97
166

The issue is then that the result makes no sense once I decode the text. When decoded twice it becomes obvious that the ®-symbols, really should be question marks. – BlueVoodoo Apr 06 '12 at 10:27
But looking through the results, it may be that this is the only one that is not working. Decoding twice on the others seems to work. Will test a bit more. – BlueVoodoo Apr 06 '12 at 10:32
Yep, everything else works. Will accept this as an answer and do a .Replace on that symbol. – BlueVoodoo Apr 06 '12 at 10:43

Issues decoding strings from Xml

1 Answers1