0

I am trying to extract £ signs from an XML feed (full version is here > http://mjsiphone.com/scotjobsnet/)

A mini version is here > http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml

I have no control over the source of the feed or the source server or how the feeds are formatted, the headers they use etc or so on.

I just have to build a .NET 4.5 console application that will extract the data and save it on our own sites database.

Also I have to strip all HTML (from job descriptions) and remove any HTML Encoded characters and replace them with their real values.

Therefore I need to save real £ signs in nvarchar datatypes in an MS SQL 2008 DB not £20,000 or £20,000 etc.

When viewing the source of the feed it has UTF-8 at the top of it.

However when viewing the feed in a browsers source I don't see any mention of UTF-8 as a Request/Response header and in the Request Headers (Chrome) I only see:

Accept-Language:en-GB,en-US;q=0.8,en;q=0.6

When I copy and paste the characters from the browser or the console into SQL and check them they return 163 which is the correct ASCII character encoding e.g £

If you view the feed in a browser the pound signs show up fine.

When I just pipe out the content to a Windows Command console they show up fine as £ signs.

However when I try to save them to the DB or pipe the console debug out to a file in EditPlus (with character encoding set to UTF8 or ASCII) I just get squares in front of the numbers instead of the signs e.g in CMD

[path to .exe] > [path to debug.log file]

Either the console cannot pipe across the content to the editor correctly or I need to use the right encoding or pass along more headers or extract the XML differently.

Here is an example of the code I am using for testing this with, using just one field that uses £ signs in it and then breaking.

static void Main(string[] args)
{           

    Console.WriteLine("START");

    XmlDocument xDoc = new XmlDocument();
    string feedURL = "http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml";

    WebClient webClient = new WebClient();
    // need to pass a user-agent > 10 Chars to prevent blocking by OUR servers 403
    webClient.Headers.Add("user-agent", "Mozilla/5.0 (compatible; Job Feed Importer;)");

    // piping out to console with this line below shows a £ but to a UTF-8 or ASCII file it's gibberish
    webClient.Headers.Add("Content-Type", "application/xml; charset=utf-8");

    // I tried this but still the console works but piping to an editor in UTF-8 or ASCII shows squares
    webClient.Headers.Add("Accept-Language", "utf-8,en-GB,en-US;q=0.8,en;q=0.6");

    // download as text - is this the problem? Should I be using a different method
    string feedText = webClient.DownloadString(feedURL);

    // load into XML object
    xDoc.LoadXml(feedText);

    if (xDoc != null)
    {
        XmlElement root = xDoc.DocumentElement;
        XmlNodeList xNodelst = root.SelectNodes("job");
        foreach (XmlNode node in xNodelst)
        {
            string salary = node.SelectSingleNode("candidateSalary").InnerText;

            // piped to cmd console the £ signs show but to a UTF-8 file they are just squares
            // I've tried adding the Encoding.UTF8 or Encoding.ASCII still no joy
            // Console.WriteLine("candidateSalary = " + salary,Encoding.UTF8);
            Console.WriteLine("candidateSalary = " + salary);

            break;
        }
    }

    Console.WriteLine("FINISH");
}

Any help will be much appreciated.

I am sure it's just a header I need to pass or maybe an issue with outputting the XML content to the Editor.

As I said before just viewing the output in a Windows console the £ show up fine.

Thanks

MonkeyMagix
  • 677
  • 2
  • 10
  • 30
  • Some of the examples should have shown & # 163; etc but stackoverflow obviously showed the encoded character as it wasn't double encoded in the editor – MonkeyMagix Jun 11 '14 at 13:01
  • Note that `application/xml` is a binary MIME type and you should let an XML parser [detect the encoding](http://www.w3.org/TR/xml/#sec-guessing). Don't trust cmd.exe. The default encoding is an old DOS OEM codepage from the 1980s and the default Windows _char_ encodings are generally "ANSI" which are also obsolete. Console.WriteLine is likely transcoding your strings (which are always UTF-16!) to some non-UTF-8 encoding. See also the CHCP command. Lastly, £ is not in the ASCII range. – McDowell Jun 11 '14 at 13:27
  • I just did a replace on the string to & #163; and then put the string into the XMLDocument. Then replaced back on insert. So you are not saying I should be passing extra headers or encoding somewhere then? – MonkeyMagix Jun 11 '14 at 15:26

1 Answers1

0

I expect the output of this command is not UTF-8:

Console.WriteLine(Console.OutputEncoding);

There are two transcoding operations here:

UTF-8 > UTF-16 string > console encoding

The correct way to detect the XML document encoding is described in the XML specification. XmlDocument will do this for you.

The console encoding can be set to UTF-8 or you could serialize encoded bytes to STDOUT directly.

Console.OutputEncoding = System.Text.Encoding.UTF8;
XmlDocument xDoc = new XmlDocument();
string feedURL = "http://scotjobsnet.co.uk.ni.strategiesuk.net/testfeed.xml";
WebClient client = new WebClient();
client.Headers.Add("user-agent", "Mozilla/5.0 (compatible; Job Feed Importer;)");
byte[] feed = client.DownloadData(feedURL);
xDoc.Load(new MemoryStream(feed));
if (xDoc != null)
{
  XmlElement root = xDoc.DocumentElement;
  XmlNodeList xNodelst = root.SelectNodes("job");
  foreach (XmlNode node in xNodelst)
  {
    string salary = node.SelectSingleNode("candidateSalary").InnerText;
    Console.WriteLine("candidateSalary = " + salary);
    break;
  }
}

More on cmd.exe and Unicode here.

McDowell
  • 107,573
  • 31
  • 204
  • 267
  • Well that code makes the console show some kind of Turkish/Arabic characters instead of pound signs in the console BUT in the debug file which is saved UTF8 it does now show pound signs. So I guess this is what I need. Thanks – MonkeyMagix Jun 13 '14 at 12:50
  • So your saying just passing a URL xDoc.Load(url) or string into Xdoc.LoadXML(string) won't handle the encoding? But if I use a memory stream and load an array of bytes into it then it will detect the correct encoding? A lot of the other answers for loading XML don't do what you show, they use WebClient.DownloadString(url) to get the string then pass it straight into the XmlDocument object OR just pass a url into it. I take it this is the correct way to ensure proper encoding is passed along> – MonkeyMagix Jun 13 '14 at 12:53
  • If you type `CHCP` in _cmd.exe_ you'll probably get [code page 850](http://en.wikipedia.org/wiki/Code_page_850) or similar. In CP850 `£` is the byte value `9C`. In UTF-8 it is the sequence `C2 A3`. When _cmd.exe_ treats this UTF-8 as CP850 it will print `┬ú`. If the console interprets the bytes as some other encoding, you'll get different corruptions. – McDowell Jun 13 '14 at 13:15
  • Regarding the XML encoding, using _XmlDocument.Load_ with a URL would be safe. Using _WebClient.DownloadString_ is potentially unsafe. As the [documentation notes](http://msdn.microsoft.com/en-us/library/fhd1f0sw.aspx): "After it downloads the resource, the method uses the encoding specified in the _Encoding_ property to convert the resource to a String." This would need to match the document encoding. – McDowell Jun 13 '14 at 13:19