0

I'm pulling data using System.Net.WebClient from a web site, and when the data comes back everything parses and looks good except letters with accents. For example, when it returns an é, SQL Server 2008 saves it as é.

Just need to figure out how to convert these UTF-8 characters into something SQL Server can read. I'm storing it in an NVARCHAR(MAX) datatype.

I'm using Linq-to-SQL to insert into the database if you were curious.

Any thoughts on what I could do to convert it to the proper format?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
jamesbar2
  • 614
  • 1
  • 9
  • 20

2 Answers2

4

Figured it out! When using the WebClient class, I was downloading the data as a string.

My Original Configuration...

System.Net.WebClient wc = new WebClient();
string htmlData = wc.DownloadString(myUri);

I tried to convert this data into a UTF-16...from it's current string, but since Microsoft operates in UTF-16, it had handled the conversion on its own.

Instead, I switched my approach to reading the actual byte[] array from the data like so...

System.Net.WebClient wc = new WebClient();
string htmlData = UTFConvert(wc.DownloadData(myUri));

private string UTFConvert(byte[] utfBytes)
{
    byte[] isoBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utfBytes);
    return Encoding.Unicode.GetString(isoBytes);
}

This fixed the problem, and SQL correctly see's the accents in everything now. Yippee.

Cheers all, and thanks for your help!

Community
  • 1
  • 1
jamesbar2
  • 614
  • 1
  • 9
  • 20
  • Every string in .NET is already UTF-16, so "converting" it into UTF-16 is nonsense. For "globalized" (localized!) systems this is also the most sensible choice. Maybe MSSQL should support UTF-8 natively, but I've found using varchar instead of nvarchar with the appropriate collation works well enough for "western language only" systems. – The Dag Nov 18 '15 at 07:32
3

Description of storing UTF-8 data in SQL Server. There is also a discussion of this topic at International Features in Microsoft SQL Server 2005. the gist of it is: SQL Server has no support for UTF-8. Feel free to upvote the request to Add support for storing UTF-8 natively in SQL Server.

As a note though, since you store Unicode string via LINQ, this would point that the problem occurs before writing into SQL Server. Namely your web pulling, does it appropriately convert the data read using an UTF-8 reader? Namely, do you read the WebResponse.GetResponseStream() via a StreamReader constructed with the appropriate UTF8Encoding? That should create the proper Unicode string and then the NVARCHAR storage in the DB (which is UCS-2) should be fine.

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • Even if SQL supported it though, the value will get mangled by C#'s utf-16 strings won't it? That has been my experience with downloading web content in C#. In the end I work through the byte[] stream of the response and replace all the high-order characters with their HTML escape codes. – Chris Sainty Apr 19 '11 at 23:25
  • 1
    @Chris: If you use the properly configured StreamReader no character should be mangled at all. The Encoding with which the reader is constructed must match the actual response (page) encoding. – Remus Rusanu Apr 19 '11 at 23:28
  • Great, this helped. I have exactly the code I used to fix it. I got the response stream via the WebClient (in a byte[] array). Then I just ran an automatic UTF-8 to UTF-16 conversion on the byte[] array to return a proper string. I'll post the exact code tomorrow when I'm allowed to post an 'answer'. Thanks for your help @Remus – jamesbar2 Apr 19 '11 at 23:50