12

I have trouble reading the csv file by python. My csv file has Korean and numbers.

Below is my python code.

import csv
import codecs
csvreader = csv.reader(codecs.open('1.csv', 'rU', 'utf-16'))
for row in csvreader:
    print(row)

First, there was a UnicodeDecodeError when I enter "for row in csvreader" line in the above code.

So I used the code below then the problem seemed to be solved

csvreader = csv.reader(codecs.open('1.csv', 'rU', 'utf-16'))

Then I ran into NULL byte error. Then I can't figure out what's wrong with the csv file.

[update] I don't think I changed anything from the previous code but my program shows "UnicodeError: UTF-16 stream does not start with BOM"

When I open the csv by excel I can see the table in proper format (image attached at the botton) but when I open it in sublime Text, below is a snippet of what I get.

504b 0304 1400 0600 0800 0000 2100 6322
f979 7701 0000 d405 0000 1300 0802 5b43
6f6e 7465 6e74 5f54 7970 6573 5d2e 786d
6c20 a204 0228 a000 0200 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

If you need more information about my file, let me know!

I appreciate your help. Thanks in advance :)

csv file shown in excel

enter image description here

csv file shown in sublime text enter image description here

Pirate X
  • 3,023
  • 5
  • 33
  • 60
Py11
  • 159
  • 1
  • 2
  • 8
  • 1
    It'd be useful if you can provide the actual file you are working with (or at least a small sample of the rows) – jorgeh Mar 19 '18 at 20:35
  • @jorgeh You mean the code that I am struggling to write? – Py11 Mar 19 '18 at 20:39
  • Is that the start of the file? If so, it doesn’t have a BOM (byte order mark), so you need to explicitly specify utf-16-le or -be to let codecs decode it properly. – abarnert Mar 19 '18 at 21:27
  • Also, don’t just say “I ran into NUL byte error”, paste the actual error into your question. – abarnert Mar 19 '18 at 21:28
  • @Py11 Yeah, both the code (to see what you've tried) and also the file you are trying to read would be very helpful for people to figure out what the issue might be – jorgeh Mar 19 '18 at 21:37
  • @abarnert Thanks for the advice. (I have changed the title) Could you tell me how I know if the file is utf-16-le or -be ? – Py11 Mar 19 '18 at 21:59
  • 1
    Now that you’ve posted the whole thing—I don’t think that’s a CSV file at all; is that an old-style binary .xls or something? – abarnert Mar 19 '18 at 22:11

4 Answers4

25

The problem is that your input file apparently doesn’t start with a BOM (a special character that gets recognizably encoded differently for little-endian vs. big-endian utf-16), so you can’t just use “utf-16” as the encoding, you have to explicitly use “utf-16-le” or “utf-16-be”.

If you don’t do that, codecs will guess, and if it guesses wrong, it’ll try to read each code point backward and get illegal values.

If your posted sample starts at an even offset and contains a bunch of ASCII, it’s little-ending, so use the -le version. (But of course it’s better to look at what it actually is than to guess.)

Harsha Biyani
  • 7,049
  • 9
  • 37
  • 61
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thanks! Is there a way to find out if my code is utf-16-le or utf-16-be? – Py11 Mar 19 '18 at 22:03
  • 1
    You’re exporting it from Excel, right? You should be able to see what you’re exporting to in the export dialog. – abarnert Mar 19 '18 at 22:10
5

Now that you’ve included more of the file in your question, that isn’t a CSV file at all. My guess is that it’s an old-style binary XLS file, but that’s just a guess. If you’re just renaming spam.xls to spam.csv, you can’t do that; you need to export it to CSV format. (If you need help with that, ask on another site that offers help with Excel instead of with programming.)

If you can’t do that for some reason, there are libraries on PyPI to parse XLS files—but if you wanted CSV, and you can export CSV, that’s a better idea.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Oh I see! That's why... this was a test code sent by my friend and I think he made a mistake on that... I will try with a new csv file then. Thanks a lot! – Py11 Mar 19 '18 at 22:55
2

After hours of struggling with such an issue, I came to learn that Excel exports data in Multiple CSV formats.

From Excel, please make sure to use 'CSV UTF-8 (Comma delimited)' option while exporting. (You often may want to use this type than the other CSV options).

Once you are sure of the UTF-type, in this case, 'UTF-8', go back to your python script and change encoding to 'UTF-8', though I found skipping this parameter also works.

with open('schools_dataset.csv', encoding='utf-8') as csv_file:
# continue opening the file
Denmau
  • 231
  • 2
  • 5
1

The file begins with a PKZIP signature so it is actually an XLSX file.

This is great because instead of a CSV file, where you would have to know the character encoding, headers, column types, delimiter, text quoting and escape rules, and line endings, you can just open it and programs can see the structure of the data.

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72