Any way to get correct conversion for unicode text format data to csv in python?

Question

I am accessing dataset that lives on ftp server. after I download the data, I used pandas to read it as csv but I got an encoding error. The file has csv file extension but after I opened the file with MS excell, data was in Unicode Text format. I want to make conversion of those dataset that stored in Unicode text format. How can I make this happen? Any idea to get this done?

my attempt:

from ftplib import FTP
import os

def mydef():
defaultIP=''
username='cat'
password='cat'

ftp = FTP(defaultIP,user=username, passwd=password)
ftp.dir()

filenames=ftp.nlst() 

for filename in files:
    local_filename = os.path.join('C:\\Users\\me', filename)
    file = open(local_filename, 'wb')
    ftp.retrbinary('RETR '+ filename, file.write)

    file.close()

ftp.quit()

then I tried this to get correct encoding:

mydef.encode('utf-8').splitlines()

but this one is not working for me. I used this solution

the output of above code:

here is output snippet of above code:

b'\xff\xfeF\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00\t'

expected output

the expected output of this dataset should be in normal csv data such as common trade data, but encoding doesn't work for me.

I used different encoding for getting the correct conversion of csv format data but none of them works for me. How can I make that work? any idea to get this done? thanks

if it is CSV file then open it in normal text editor to see what you have. It doesn't look like CSV file. Or maybe it doesn't use `utf-8` but other encoding - ie. `utf-16`. `utf-16 sometimes is used on Windows. — furas, Jan 14 '20 at 21:18

furas · Accepted Answer · 2020-01-14T21:35:06.093

2

EDIT: I have to change it - now I remove 2 bytes at the beginning (BOM) and one byte at the end because data is incomplete (every char needs 2 bytes)

It seems it is not utf-8 but utf-16 with BOM

If I remove first two bytes (BOM - Bytes Order Mark) and last byte at the end because it is incomplete (every char needs two bytes) and use decode('utf-16-le')

b'F\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00'.decode('utf-16-le')

then I get

'FLOW\tCTY_RPT\tREPORTER\tCTY_PTN\tPARTNER\tCOMMODITY\tDESCRIPTION'

EDIT: meanwhile I found also Python - Decode UTF-16 file with BOM

edited Jan 14 '20 at 21:35

answered Jan 14 '20 at 21:24

furas

134,197
12
106
148

BOM in UTF-16 is 2 bytes, not 3. – Mark Ransom Jan 14 '20 at 21:31
I couldn't decode it so I removed third byte - but I found problem - text is incomplete and I have to remove last byte and then I can remove 2 bytes BOM at the beginning and decode. – furas Jan 14 '20 at 21:33
@MarkRansom I changed it - I had to remove last bytes instead of third at the beginning. – furas Jan 14 '20 at 21:40
@furas is there any more consistent solution instead of removing the third byte manually? since actual output snippet is a lot – Jerry07 Jan 14 '20 at 22:43
1

read my current answer - now I remove only two bytes at beginning because BOM has 2 bytes - if you use full data then you don't have to remove third byte - but I had to remove because you gave incomplete data and it didn't work with last byte. in UTF-16 every char uses 2 bytes and data needs even number of bytes to decode it. – furas Jan 14 '20 at 22:55

Any way to get correct conversion for unicode text format data to csv in python?

1 Answers1