0

I have been given a large quantity of Xml's where I need to pull out parts of the text elements and reuse it for other purposes. (I am using XDocument to pull Xml data).

But, how do I decode the text contained in the elements? What is even the formatting used here? A few examples:

"What is the meaning of this® asks Sonny."
"The big centre cost 1¾ million pounds"
"... lost it. ® The next ..."

I have tried HttpUtility.HtmlDecode but that did not do the trick. If I decode twice the "®" turns into a ® which is obviously not right.

Looks like ® are line breaks. The ® are probably question marks. The 190 one, I don't even know. Perhaps a dot or comma?

Any ideas would be welcome.

animuson
  • 53,861
  • 28
  • 137
  • 147
BlueVoodoo
  • 3,626
  • 5
  • 29
  • 37

1 Answers1

0

It does appear that the strings you show have been HTML encoded, and then XML encoded (or HTML again).

It is correct that ® -> ® -> ® (the registered trademark symbol) per the ISO Latin-1 entities - ® should behave the same way

Similarly &amp#190; would turn into a fraction representing three quarters.

Rowland Shaw
  • 37,700
  • 14
  • 97
  • 166
  • The issue is then that the result makes no sense once I decode the text. When decoded twice it becomes obvious that the ®-symbols, really should be question marks. – BlueVoodoo Apr 06 '12 at 10:27
  • But looking through the results, it may be that this is the only one that is not working. Decoding twice on the others seems to work. Will test a bit more. – BlueVoodoo Apr 06 '12 at 10:32
  • Yep, everything else works. Will accept this as an answer and do a .Replace on that symbol. – BlueVoodoo Apr 06 '12 at 10:43