As output from another service, I have received many Excel files that have a
".csv" filename extension but are saved as "Unicode Text (*.txt)".
To my knowledge they don't have an Unicode characters so I am not worried about data loss and if I was producing the data I would not have saved it this way. However, I need to process 100s of these files and I have been unable to import them using Python (specifically Python 3).
I have tried many different options such as the
csv module, pandas.read_csv(), and pyexcel.get_sheet()
to import the Excel file directly without any sucess. Often with errors such as
"... can't decode byte 0xff in position 0: invalid start byte".
I can manually save the file in Excel with a ".csv" extension and a "CSV (Comma delimited)(*.csv)" file type which can then be imported (e.g. pyexcel.get_sheet() ), but can't figure out how to do this programmatically.
I can also manually open the original file in Notepad and save it as a text file with a ".txt" extension and ANSI encoding, which allows me to import the data using numpy.loadtxt(). This isn't ideal because it is also manual. Additionally, I don't know why I need to convert to ANSI encoding and can't use UTF-8 for it to be read using
numpy.loadtxt(file_name, encoding="UTF-8"),
which results in an error such as
"... ValueError: could not convert string to float: '\ufeff ..."
and the following error using just numpy.loadtxt(file_name)
"... ValueError: could not convert string to float: ' ..."
In summary, my main goal is to find a programmatic way to change the Excel file type/encoding to something I can import into Python 3 using existing packages with CSV support. Additionally, if someone has any idea why numpy.loadtxt can't import a text file with a UTF-8 encoding (but can for ANSI encoding) that would be great too! Any knowledge to help me understand the problem (or my misconception of the problem) is appreciated.