-1

I'm using pymssql to get some data from the SQL server and store the results in a pandas dataframe. When I try to select a column that contains utf-8 (Farsi) characters, I get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte

But everything is fine with other columns in the database.

This is my code snippet and I'm running the code with python3.6:

import pymssql
import pandas as pd

conn = pymssql.connect(server, username, password, database)
cursor = conn.cursor(as_dict=True)

cursor.execute("""
 SELECT id, title
 FROM products
""")


df = pd.DataFrame(columns=['id', 'title'])

for row in cursor:
    df = df.append(row, ignore_index=True)


conn.close()
Saeed Esmaili
  • 764
  • 3
  • 12
  • 34
  • Is [title] the column with the Farsi characters? If so, does `SELECT DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME='products' AND COLUMN_NAME='title'` return 'nvarchar'? – Gord Thompson Jun 26 '18 at 12:52
  • @GordThompson Yes, the [title] contains the Farsi characters. The query you mentioned returns empty result. – Saeed Esmaili Jun 27 '18 at 05:19
  • Okay, can you use SSMS to verify that the [title] column is indeed of type `nvarchar`? – Gord Thompson Jun 27 '18 at 20:26
  • Also, are the table and column names really 'products' and 'title' or are they something more like N'محصولات' and N'عنوان'? – Gord Thompson Jun 27 '18 at 23:24
  • @GordThompson I'm using Datagrip on Ubuntu to access the database. The column names are like "orderId", "UserId" and so on. They are English. – Saeed Esmaili Jul 02 '18 at 10:16
  • [This page](https://www.jetbrains.com/datagrip/features/dbobjects.html) indicates that DataGrip can display the column names and types for a given table. What does it show for the particular column in your table? – Gord Thompson Jul 03 '18 at 00:09

3 Answers3

2

Are you 100% certain the data is stored in UTF-8? Running the command SELECT SERVERPROPERTY('Collation'); should help you determine how data is encoded in the database.

I think the default encoding is Latin-1 and that would mean 0xCA is "capital E circumflex (Ê)".

You can configure pymssql to access the database using that encoding by adding the argument charset="ISO-8859-1" to your connection arguments.

IronFarm
  • 407
  • 4
  • 5
2

First of all on connection define charset, if it is not UTF-8, next, nvarchar and varchar. nvarchar stored utf-8, varchar not.

In my case, (i have this and this on table) I connect:

conn = pymssql.connect(host='xxx', user='xxx',
                   password='xxx', database='xxx',charset='CP1250')
cur = conn.cursor(as_dict=True)

and in query must convert nvarchar to varchar:

cur.execute(
        """SELECT convert(varchar, column) as column 
         FROM table 
         """)
0

The problem is in pymysql.

I connenct by pyodbc and all problems disappeared