1

Environment: VB.NET, VS 2022 (latest build), SQL Server 2019 (latest patch)

I'm attempting to execute a simple INSERT of varchar data containing certain UTF-8 data, and regularly getting the SqlException "Invalid data for UTF8-encoded characters" when a single character is surrounded by left or right curly quotes (e.g. [left double quote]J[right double quote] -- like "J" with curly double quotes around it).

The error also occurs with curly single quotes, but DOES NOT occur with similar data not consisting of single characters not bounded in such a way (e.g. the contraction "isn[apostrophe]t" -- "isn't" with a curly apostrophe).

My code is similar to:

Dim SQL As String = "INSERT INTO MyTable ([MyField]) VALUES (@MyField)"
Dim curName As String

Using cmd As New SqlCommand(SQL, conn)
    cmd.Parameters.Add("@MyField", SqlDbType.VarChar, 80)

    While ...
       ...loop through filereader taking in UTF-8 encoded text data 

       curName = Await oRead.ReadLineAsync()
       cmd.Parameters("@MyField").Value = curName
       Try
          Await cmd.ExecuteNonQueryAsync()
       Catch ex As Exception
          Debug.Print("Error: " & ex.Message)  ' <--- Shows error here
       End Try
  End While
End Using

The column is set to varchar, with the database collation set to Latin1_General_100_CI_AS_SC_UTF8.

Any idea what's going, or if a workaround is possible?

Michael Liu
  • 52,147
  • 13
  • 117
  • 150
pbickford
  • 59
  • 1
  • 14
  • *"regularly getting the above error "* what error? – Thom A Jun 23 '23 at 17:37
  • 1
    [What driver](https://learn.microsoft.com/en-us/sql/connect/oledb/features/utf-8-support-in-oledb-driver-for-sql-server?view=sql-server-ver16) are you using in your connection string? – GSerg Jun 23 '23 at 17:38
  • "Invalid Data for UTF-8 Encoded Characters" – pbickford Jun 23 '23 at 18:44
  • Using SqlClient – pbickford Jun 23 '23 at 18:45
  • SqlClient is not a driver. – GSerg Jun 23 '23 at 20:43
  • Full Connectionstring: server=; database=; User ID=; Password= providerName="System.Data.SqlClient" – pbickford Jun 23 '23 at 22:09
  • Try [MSOLEDBSQL](https://www.connectionstrings.com/ole-db-driver-for-sql-server/). – GSerg Jun 23 '23 at 22:49
  • Try `Microsoft.Data.SqlClient` Nuget package – Charlieface Jun 25 '23 at 10:52
  • Hi Charlieface -- Tried with latest SQLclient nuget package--no change in result. – pbickford Jun 29 '23 at 20:21
  • Are you talking about [Smart quotes](https://support.microsoft.com/en-us/office/smart-quotes-in-word-702fc92e-b723-4e3d-b2cc-71dedaf2f343)? The following may be of interest: [201C](https://www.compart.com/en/unicode/U+201C), [201D](https://www.compart.com/en/unicode/U+201D), [2018](https://www.compart.com/en/unicode/U+2018), and [2019](https://www.compart.com/en/unicode/U+2019). – Tu deschizi eu inchid Jul 01 '23 at 16:45
  • In order to use Unicode data, the database column needs to be `nvarchar`. You may consider, replacing smart quotes with normal quotes or change the database column data type. – Tu deschizi eu inchid Jul 01 '23 at 16:51
  • @Tudeschizieuinchid that's completely incorrect. An `nvarchar` column in a UTF-8 database would contain UTF-16 data. UTF-8 specifically affects `varchar` columns. – Thom A Jul 03 '23 at 14:39
  • As evidence, notice that the UTF16 and USC2 columns have the same value in this [db<>fiddle](https://dbfiddle.uk/PGYKBvfJ) (*The UTF8 column must first be a `nvarchar` and then `CAST`/`CONVERT`ed to a `varcahr` for UTF-8 due to the database in DB Fiddle not being in a UTF-8 collation. This is not required in a [UTF-8 database](https://i.stack.imgur.com/X50o2.png).*) – Thom A Jul 03 '23 at 14:55
  • @ThomA: According to [UTF-16 and UTF-8 Encoding – SQL Server](https://sqlrebel.org/2021/07/29/utf-16-and-utf-8-encoding-sql-server/), it seems that I'm only _somewhat_ incorrect - unless I misunderstood the article. – Tu deschizi eu inchid Jul 03 '23 at 15:34
  • 1
    The linked article states that an `nvarchar` column would be unchanged, @Tudeschizieuinchid *"The UTF-16 character encoding used by nchar and nvarchar will remain unchanged."* So an `nvarchar` would be an UTF-16 value still, not UTF-8. The difference in a UTF-8 collation is that *both* `(var)char` and `n(var)char` can store Unicode data, just UTF-8 and UTF-16 respectively. – Thom A Jul 03 '23 at 15:41

3 Answers3

1

I don’t fully understand the underlying issue — my best guess is that System.Data.SqlClient is old and simply doesn’t support UTF-8 collations — but I found a couple different workarounds:

  1. Declare your SqlParameter as NVarChar instead of VarChar:

    cmd.Parameters.Add("@MyField", SqlDbType.NVarChar, 80)
    

    Now the string will be successfully passed to SQL Server as UCS-2/UTF-16. SQL Server will then automatically convert the string to UTF-8 when it inserts the string into the table.

  2. Upgrade from System.Data.SqlClient to the Microsoft.Data.SqlClient NuGet package. (The Release Notes mention UTF-8 support.) If you go with this option, you must import SQL-related classes like SqlConnection and SqlCommand from the Microsoft.Data.SqlClient namespace instead of System.Data.SqlClient:

    'Imports System.Data.SqlClient
    Imports Microsoft.Data.SqlClient
    

Analysis

In the .NET Framework, the internal TdsParser class is responsible for sending queries to SQL Server in binary format. The class contains the following lines of code (2410-2411):

_defaultCodePage = Int32.Parse(stringCodePage, NumberStyles.Integer, CultureInfo.InvariantCulture);
_defaultEncoding = System.Text.Encoding.GetEncoding(_defaultCodePage);

This suggests that for all Latin1_General collations, TdsParser encodes strings using code page 1252, which is wrong for UTF-8 collations and produces garbage bytes.

Michael Liu
  • 52,147
  • 13
  • 117
  • 150
  • The full solution was: - Change the connection string to drop the "Provider=System.Data.SQLClient" part - Add the latest Microsoft.Data.SQLClient package to the project - Replace all references to System.Data.SQLClient with Microsoft.Data.SQLClient - Remove any code which relied on EnterpriseLibrary.Data (not in above example, but relevant to using UTF-8 in general for the project). The tricky part is that System.Data.SQLClient _ALMOST_ works with UTF-8 -- in fact it works in all cases I've found other than the one where single characters are surrounded by single or double smart quotes – pbickford Jul 07 '23 at 18:02
0

Using a Stored Procedure is far safer and faster than using commands within the code.

Anyway, try it:

Using cmd As New SqlCommand(SQL, conn)
    cmd.CommandType = "INSERT INTO MyTable(MyField1, MyField2, ...) VALUES (@Data1, @Data2,...)"
    cmd.Parameters.AddWithValue("@MyField1", Data1)
    cmd.Parameters.AddWithValue("@MyField2", Data2)
    Result = cmd.ExecuteScalar()
    <close your connection>
End Using
David BS
  • 1,822
  • 1
  • 19
  • 35
-1

you need to use a procedure for that the basic problem here that is causing the problem is that having ("') causes sql injection in simple terms infect the sql query so the best way to prevent from the same issue is using procedure instead of trying to fit it in sql query

  • 1
    This answer is wrong. The OP is using parameters instead of string concatenation, so SQL injection is impossible. (It is a myth that the only way to avoid SQL injection is to use procedures.) – Michael Liu Jul 06 '23 at 20:39