UnicodeDecodeError in pymssql library of Python

Question

I'm using pymssql to get some data from the SQL server and store the results in a pandas dataframe. When I try to select a column that contains utf-8 (Farsi) characters, I get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte

But everything is fine with other columns in the database.

This is my code snippet and I'm running the code with python3.6:

import pymssql
import pandas as pd

conn = pymssql.connect(server, username, password, database)
cursor = conn.cursor(as_dict=True)

cursor.execute("""
 SELECT id, title
 FROM products
""")


df = pd.DataFrame(columns=['id', 'title'])

for row in cursor:
    df = df.append(row, ignore_index=True)


conn.close()

Is [title] the column with the Farsi characters? If so, does `SELECT DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME='products' AND COLUMN_NAME='title'` return 'nvarchar'? — Gord Thompson, Jun 26 '18 at 12:52
@GordThompson Yes, the [title] contains the Farsi characters. The query you mentioned returns empty result. — Saeed Esmaili, Jun 27 '18 at 05:19
Okay, can you use SSMS to verify that the [title] column is indeed of type `nvarchar`? — Gord Thompson, Jun 27 '18 at 20:26
Also, are the table and column names really 'products' and 'title' or are they something more like N'محصولات' and N'عنوان'? — Gord Thompson, Jun 27 '18 at 23:24
@GordThompson I'm using Datagrip on Ubuntu to access the database. The column names are like "orderId", "UserId" and so on. They are English. — Saeed Esmaili, Jul 02 '18 at 10:16
[This page](https://www.jetbrains.com/datagrip/features/dbobjects.html) indicates that DataGrip can display the column names and types for a given table. What does it show for the particular column in your table? — Gord Thompson, Jul 03 '18 at 00:09

score 2 · Answer 1 · answered Jun 27 '18 at 13:02

Are you 100% certain the data is stored in UTF-8? Running the command SELECT SERVERPROPERTY('Collation'); should help you determine how data is encoded in the database.

I think the default encoding is Latin-1 and that would mean 0xCA is "capital E circumflex (Ê)".

You can configure pymssql to access the database using that encoding by adding the argument charset="ISO-8859-1" to your connection arguments.

score 2 · Answer 2 · answered Feb 23 '22 at 14:20

First of all on connection define charset, if it is not UTF-8, next, nvarchar and varchar. nvarchar stored utf-8, varchar not.

In my case, (i have this and this on table) I connect:

conn = pymssql.connect(host='xxx', user='xxx',
                   password='xxx', database='xxx',charset='CP1250')
cur = conn.cursor(as_dict=True)

and in query must convert nvarchar to varchar:

cur.execute(
        """SELECT convert(varchar, column) as column 
         FROM table 
         """)

score 0 · Answer 3 · answered Nov 14 '22 at 20:32

0

The problem is in pymysql.

I connenct by pyodbc and all problems disappeared

answered Nov 14 '22 at 20:32

Kordian Zadrożny

45
7

UnicodeDecodeError in pymssql library of Python

3 Answers3