1

I am accessing dataset that lives on ftp server. after I download the data, I used pandas to read it as csv but I got an encoding error. The file has csv file extension but after I opened the file with MS excell, data was in Unicode Text format. I want to make conversion of those dataset that stored in Unicode text format. How can I make this happen? Any idea to get this done?

my attempt:

from ftplib import FTP
import os

def mydef():
defaultIP=''
username='cat'
password='cat'

ftp = FTP(defaultIP,user=username, passwd=password)
ftp.dir()

filenames=ftp.nlst() 

for filename in files:
    local_filename = os.path.join('C:\\Users\\me', filename)
    file = open(local_filename, 'wb')
    ftp.retrbinary('RETR '+ filename, file.write)

    file.close()

ftp.quit()

then I tried this to get correct encoding:

mydef.encode('utf-8').splitlines()

but this one is not working for me. I used this solution

the output of above code:

here is output snippet of above code:

b'\xff\xfeF\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00\t'

expected output

the expected output of this dataset should be in normal csv data such as common trade data, but encoding doesn't work for me.

I used different encoding for getting the correct conversion of csv format data but none of them works for me. How can I make that work? any idea to get this done? thanks

Jerry07
  • 929
  • 1
  • 10
  • 28
  • if it is CSV file then open it in normal text editor to see what you have. It doesn't look like CSV file. Or maybe it doesn't use `utf-8` but other encoding - ie. `utf-16`. `utf-16 sometimes is used on Windows. – furas Jan 14 '20 at 21:18

1 Answers1

2

EDIT: I have to change it - now I remove 2 bytes at the beginning (BOM) and one byte at the end because data is incomplete (every char needs 2 bytes)


It seems it is not utf-8 but utf-16 with BOM

If I remove first two bytes (BOM - Bytes Order Mark) and last byte at the end because it is incomplete (every char needs two bytes) and use decode('utf-16-le')

b'F\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00'.decode('utf-16-le')

then I get

'FLOW\tCTY_RPT\tREPORTER\tCTY_PTN\tPARTNER\tCOMMODITY\tDESCRIPTION'

EDIT: meanwhile I found also Python - Decode UTF-16 file with BOM

furas
  • 134,197
  • 12
  • 106
  • 148
  • BOM in UTF-16 is 2 bytes, not 3. – Mark Ransom Jan 14 '20 at 21:31
  • I couldn't decode it so I removed third byte - but I found problem - text is incomplete and I have to remove last byte and then I can remove 2 bytes BOM at the beginning and decode. – furas Jan 14 '20 at 21:33
  • @MarkRansom I changed it - I had to remove last bytes instead of third at the beginning. – furas Jan 14 '20 at 21:40
  • @furas is there any more consistent solution instead of removing the third byte manually? since actual output snippet is a lot – Jerry07 Jan 14 '20 at 22:43
  • 1
    read my current answer - now I remove only two bytes at beginning because BOM has 2 bytes - if you use full data then you don't have to remove third byte - but I had to remove because you gave incomplete data and it didn't work with last byte. in UTF-16 every char uses 2 bytes and data needs even number of bytes to decode it. – furas Jan 14 '20 at 22:55