1

As output from another service, I have received many Excel files that have a

".csv" filename extension but are saved as "Unicode Text (*.txt)".

To my knowledge they don't have an Unicode characters so I am not worried about data loss and if I was producing the data I would not have saved it this way. However, I need to process 100s of these files and I have been unable to import them using Python (specifically Python 3).

I have tried many different options such as the

csv module, pandas.read_csv(), and pyexcel.get_sheet()

to import the Excel file directly without any sucess. Often with errors such as

"... can't decode byte 0xff in position 0: invalid start byte".

I can manually save the file in Excel with a ".csv" extension and a "CSV (Comma delimited)(*.csv)" file type which can then be imported (e.g. pyexcel.get_sheet() ), but can't figure out how to do this programmatically.

I can also manually open the original file in Notepad and save it as a text file with a ".txt" extension and ANSI encoding, which allows me to import the data using numpy.loadtxt(). This isn't ideal because it is also manual. Additionally, I don't know why I need to convert to ANSI encoding and can't use UTF-8 for it to be read using

numpy.loadtxt(file_name, encoding="UTF-8"),

which results in an error such as

"... ValueError: could not convert string to float: '\ufeff ..."

and the following error using just numpy.loadtxt(file_name)

"... ValueError: could not convert string to float: ' ..."

In summary, my main goal is to find a programmatic way to change the Excel file type/encoding to something I can import into Python 3 using existing packages with CSV support. Additionally, if someone has any idea why numpy.loadtxt can't import a text file with a UTF-8 encoding (but can for ANSI encoding) that would be great too! Any knowledge to help me understand the problem (or my misconception of the problem) is appreciated.

jpantina
  • 197
  • 1
  • 8
  • `loadtxt` (and `genfromtxt`) defaults to parsing the data as floats - it tries to convert each string to a number. That isn't a unicode or bytestring issue. You need to specify the appropriate `dtype`. For a start I'd suggest `dtype=None`, which lets it deduce the right dtype for each column. You may also need to handle the header lines. – hpaulj Apr 18 '18 at 21:59
  • @hpaulj: I appreciate the suggestion however this did not solve the problem when trying to load the text files with UTF-8 encoding. The same Value Error occurs when using dtype=None. The data in the text file does not have any headers and all values are floats. – jpantina Apr 18 '18 at 22:24
  • Without a sample file, it'll be hard to help you. If, for example, you posted a few lines of the file, I could copy-n-paste that and make my own file to test. – hpaulj Apr 18 '18 at 22:29
  • A recent utf8 csv example: https://stackoverflow.com/questions/49859957/importing-csv-embedding-special-character-with-numpy-genfromtxt – hpaulj Apr 18 '18 at 22:37
  • @hpaulj: I just made a text file with "1\t2\n3\t4\n" and saved it as .txt with UTF-8 encoding and get the same error as my other data. After save-as to change the file to ANSI encoding it loads with no problems using np.loadtxt. This doesn't really solve my problem anyway since I was doing it manually. – jpantina Apr 18 '18 at 22:51
  • `genfromtxt` puts `nan` in the slots that have text that can't be decoded as floats. – hpaulj Apr 18 '18 at 23:11

3 Answers3

3

Your file has a UTF-8 BOM at the front. Python can strip this automatically with the utf_8_sig codec:

numpy.loadtxt(file_name, encoding="utf_8_sig"),
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • OP got FF as the first byte, so `utf16`. UTF-8 BOM starts with EF. – Mark Tolonen Apr 19 '18 at 01:50
  • @roeland No, "can't decode byte 0xff in position 0: invalid start byte". The first byte read from the file was FF. That's a UTF-16 BOM. See my answer which reads a Unicode Text-saved Excel file. – Mark Tolonen Apr 19 '18 at 02:25
  • @MarkTolonen I'm looking at the two error messages at the end, one of which clearly is the BOM as a single Unicode character and the other which is the 3 byte UTF-8 equivalent. There's nothing in the question that indicates these files originated in Excel. It will be interesting to see which of us is right. – Mark Ransom Apr 19 '18 at 02:30
  • The first two lines of the question: `I have received many Excel files that have a ".csv" filename extension but are saved as "Unicode Text (*.txt)".` That last part is a save option in Excel, which I used in my answer to generate the file. Also, the last half of the question states the OP attempted "manually save the file in Excel with a ".csv" extension and a "CSV (Comma delimited)(*.csv)" which resulted in the later errors. – Mark Tolonen Apr 19 '18 at 02:43
  • @MarkTolonen How did I miss that? Must be time to give up for the night. – Mark Ransom Apr 19 '18 at 02:58
  • Yes, the OP should remove the last 50% of the question. The first part is a clear description of the problem and the last part is an XY problem :^) – Mark Tolonen Apr 19 '18 at 03:01
  • @MarkRansom: Thanks for the answer to the second part of the question. Sorry for the confusion about multiple questions in the same post, however the responses that resulted added in my understanding of the problem. – jpantina Apr 19 '18 at 16:24
2

Excel's "Unicode Text" is UTF-16-encoded and tab-delimited. With the csv module, use:

import csv
with open('book1.txt',encoding='utf16') as f:
    r = csv.reader(f,delimiter='\t')
    for row in r:
        print(row)

With the following Excel sheet saved as "Unicode Text" (Excel 2016):

Excel table of English/Chinese

This produces (Python 3.6):

['English', 'Chinese']
['Mark', '马克']
['American', '美国人']

Pandas also works (but not with utf16...it oddly needed the hyphen):

>>> import pandas as pd
>>> pd.read_csv('book1.txt',delimiter='\t',encoding='utf-16')
    English Chinese
0      Mark      马克
1  American     美国人
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

Default dtype is float. genfromtxt uses nan when it can't parse the string:

In [8]: np.genfromtxt(['one, 2, 3'], delimiter=',',encoding=None)
Out[8]: array([nan,  2.,  3.])

loadtxt raises an error:

In [9]: np.loadtxt(['one, 2, 3'], delimiter=',',encoding=None)   
....
ValueError: could not convert string to float: 'one'
In [10]: np.loadtxt(['1, 2, 3'], delimiter=',',encoding=None)
Out[10]: array([1., 2., 3.])

genfromtxt with dtype=None:

In [12]: np.genfromtxt(['one, 2, 3'], delimiter=',',encoding=None,dtype=None)
Out[12]: array(('one', 2, 3), dtype=[('f0', '<U3'), ('f1', '<i8'), ('f2', '<i8')])

from your comment:

In [13]: np.genfromtxt('1\t2\n3\t4\n'.splitlines(), delimiter='\t',encoding=None
    ...: )
Out[13]: 
array([[1., 2.],
       [3., 4.]])

In [21]: print('1\t2\n3\t4\n')
1   2
3   4

Using encoding='utf8' works as well.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • While useful knowledge, I respectfully can't accept this as an answer. This doesn't address my original question regarding file types. Additionally, your genfromtxt example (the "\t" and "\n" were just shorthand to represent the data for the file) isn't the same because it it should be saved to a text file with UTF-8 encoding to replicate my issue. – jpantina Apr 18 '18 at 23:39
  • I demonstrated this in a Py3 Ipython session, so the strings are unicode. In earlier numpy versions I would have use bytestrings, `b'1,2'`. A utf8 file with the same values works the same. – hpaulj Apr 19 '18 at 00:25