2

I'm using the below Python script to import the CSV file which is placed in the server to a PostgreSQL table.

But i'm getting the below error.

Error while fetching data from PostgreSQL COPY from stdin failed: error in .read() call: UnicodeDecodeError 'utf-8' codec can't decode byte 0xdf in position 1237: invalid continuation byte

The CSV file is in "ufl.csv: ISO-8859 text, with very long lines" and my server is in UTF encoding, So can anyone suggest or help me to modify the below script without explicitly converting the CSV file to UTF encoding.Can this be done by code?

The below code is working fine, if i convert the encoding of the CSV file to UTF.

import csv
import psycopg2
import time
import os
from datetime import datetime
import shutil

# File path.
filePath='''/Users/linu/Downloads/ufl.csv'''
dirName = '/Users/linu/Downloads/ufl_old_files/'

try:
  conn = psycopg2.connect(host="localhost", database="postgres", user="postgres", password="postgres", port="5432")

  print('DB connected')

except (Exception, psycopg2.Error) as error:
        # Confirm unsuccessful connection and stop program execution.
        print ("Error while fetching data from PostgreSQL", error)
        print("Database connection unsuccessful.")
        quit()

# Check if the CSV file exists.
if os.path.isfile(filePath):
 try:
     print('Entered loop')   
     sql = "COPY %s FROM STDIN WITH DELIMITER AS ';'  csv header"
     file = open('/Users/linu/Downloads/ufl.csv', "r")
     table = 'staging.ufl_tracking_details'

     with conn.cursor() as cur:
        cur.execute("truncate " + table + ";")
        print('truncated the table')
        cur.copy_expert(sql=sql % table, file=file)
        print('Data loaded')
        conn.commit()
        cur.close()
        conn.close()

 except (Exception, psycopg2.Error) as error:
        print ("Error while fetching data from PostgreSQL", error)
        print("Error adding  information.")
        quit()

 if not os.path.exists(dirName):
    os.mkdir(dirName)
    print("Directory " , dirName ,  " Created ")
 else:    
    print("Directory " , dirName ,  " already exists")
 tstamp = os.path.getmtime(path)   
 timestamp_name=str(time.time())
 os.rename(filePath,dirName + timestamp_name+'.csv')

else:
    # Message stating CSV file could not be located.
    print("Could not locate the CSV file.")
    quit()

Have gone through posts and used "copy_expert" as mentioned in few and tried some other resolutions as well, but none of them worked out . Any hint or suggestions will be of great help.

Note : requirement to port the CSV file and once porting is done move the copied CSV file to a folder and renaming it to name + timestamp.

Thanks in advance

Linu
  • 589
  • 1
  • 10
  • 23
  • 1
    Have you tried opening the file with the correct encoding? `file = open(filePath, encoding='latin_1')`. Note, ISO-8859 is rather a series of standards with part one (ISO-8859-1) being most widely used. In Python, ISO-8859-1 is aliased to latin-1, hence I used it here. – shmee Jul 02 '19 at 06:21
  • @shmee It worked, Thanks a lot :) can you post this as an answer? So that i can mark the same. – Linu Jul 02 '19 at 07:53
  • @shmee Can you post this as an answer please. – Linu Jul 02 '19 at 15:09
  • Sorry, I wanted to do some more research before posting the answer and didn't find the time earlier :) – shmee Jul 02 '19 at 18:28

1 Answers1

2

The UnicodeDecodeError raised in the cursor indicates an encoding mismatch. Apparently the file contains at least one German sharp s (ß). In Latin-1, (ISO-8859-1) and other encodings, e.g. Cp1252, this is encoded as 0xdf while in UTF-8 it is encoded as 0xc3 0x9f, hence UTF-8 can't decode the Latin-1-encoded character.

print(b'\xc3\x9f-'.decode("utf-8"))
# ß-
print(b'\xdf-'.decode("utf-8"))
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 0: invalid continuation byte

Note: The hyphen (-) was added to force the invalid continuation byte error. Without it, the second print would raise the UnicodeDecodeError for unexpected end of data.

These errors raise because UTF-8's first byte is exhausted at 0x7f and 0xdf is in the range for two-byte encoded characters. UTF-8 expects one more byte to decode a character in this range.
See also this Q&A.

If you don't provide an encoding to the open() call, the encoding is determined via locale.getpreferredencoding(False), which appears to return UTF-8 for you.

You have to pass the file's encoding to the open() call:

file = open(filePath, encoding="latin_1") 
shmee
  • 4,721
  • 2
  • 18
  • 27