2

I was having some difficulties with some text I was receiving from a Web Service I consume recently. The web service sends back XML, which is fine, but we're getting ASCII control characters in the middle of some of the XML. I wanted to paste an example in this posting but being invalid characters, I can't even paste it into this textarea.

I spent some time researching what to do in these cases and I found this informative article: http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/. Here is a quote from this article that is relevant:

These aren’t characters that have any business being in XML data; they’re illegal characters that should be removed...

So, following the article's advice I've written some code to take the raw output from this service and strip it of any character that is a control character (and that is not a space, tab, cr or lf)

Here is that code:

System.Net.WebClient client = new System.Net.WebClient();

byte[] invalidCharacters = { 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 
                             0xC, 0xE, 0xF, 0x10, 0x11, 0x12, 0x14, 0x15, 0x16, 
                             0x17, 0x18, 0x1A, 0x1B, 0x1E, 0x1F, 0x7F };

byte[] sanitizedResponse = (from a in client.DownloadData(url)
                            where !invalidCharacters.Contains(a)
                            select a).ToArray();

result = System.Text.UTF8Encoding.UTF8.GetString(sanitizedResponse);

This got me thinking though. If I receive double-byte characters, will I screw up any of the data I'm getting back? Is it valid for some codepages to have double-byte characters that are made up of one or two single byte ASCII control characters? The article saying that these characters have "no business" being in XML data sounds final but I want a second opinion.

Appreciate any feedback

omatase
  • 1,551
  • 1
  • 18
  • 42

2 Answers2

2

Well, the code you've shown is assuming UTF-8 - which would never have any of those bytes in its data (other than for those characters), due to the way it's designed. However, I'd encourage a text-driven approach instead of this byte-driven approach - I'd probably use DownloadString instead of DownloadData (and rely on WebClient picking the right encoding) but then scrub the data with a regex before parsing it.

I'd also contact the web service provider to explain that they're serving duff XML...

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
0

Try the following:

byte[] byteArray = Encoding.ASCII.GetBytes( test ); 
MemoryStream stream = new MemoryStream( byteArray );    
stream.Position = 0;
StreamReader reader = new StreamReader( stream );            
string text = reader.ReadToEnd(); 
vegemite4me
  • 6,621
  • 5
  • 53
  • 79
Kingjj
  • 1