File type from pandas.DataFrame.to_excel is "Zip archive data, at least v2.0 to extract"

Question

I notice that the file type from an Excel file generated by pandas.DataFrame.to_excel is Zip archive data, at least v2.0 to extract. Please do note that the content type is fine: content_type, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.

In my Django project, I am essentially validating a file type before processing the uploaded file, and although the file generated by pandas.DataFrame.to_excel is a valid Excel file, the validation module is rejecting the uploaded file because of the file type being Zip archive data, at least v2.0 to extract, instead of Microsoft Excel 2007+.

Please let me know how I can get around this validation.

The code I used to replicate (i.e., to create an Excel file with the file type Zip archive data, at least v2.0 to extract) this issue is here.

import pandas as pd
import os
import magic

uploaded_file_path = r'somepath'
path, filename = os.path.split(uploaded_file_path)
filename_without_extension = os.path.splitext(filename)
new_file_name = os.path.join(path, filename_without_extension[0]) + '_TESTING_BLAH_' + str(1) + '.xlsx'


df1 = pd.DataFrame([['a', 'b'], ['c', 'd']],
                   index=['row 1', 'row 2'],
                   columns=['col 1', 'col 2'])

df1.to_excel(new_file_name)

file_type = magic.from_file(new_file_name)
print(file_type)

Maybe a libmagic issue. Have you checked if native Excel files are identified correctly and what about LibreOffice files saved as xlsx? — Chris, Jan 30 '20 at 17:23
The native files are indeed identified correctly. In fact, opening the Excel file created using Pandas and saving it once solved this problem, and that was my workaround. However, the issue continues to be that the file type created by Pandas is still "Zip archive data, at least v2.0 to extract." — Premlal Premkumar, Jan 31 '20 at 12:23
Yes indeed, I could reproduce the behaviour. The same happens with files created with OpenOffice and saved as xlsx files. Probably Pandas and OO are using an OpenSource lib which creates Excel files which are not exactly the same as native MS Excel. Currently looking for a workaround — Chris, Jan 31 '20 at 16:24

score 1 · Answer 1 · answered Jan 31 '20 at 17:39

1

As suspected the behaviour seems to have something to do with the way the Excel files are created. The xlsx files created by open source libraries have a dffierent magic number then the xlsx files created by MS Excel. A similar issue can be found here. The default dB libmagic uses obviously does not recognize those files as Excel files.

The post also desribes a possible solution. You can add custom definitions to the file /etc/magic. And there is a file you can copy and paste which seems to work.

So copy the contents of this msooxml file to the the file /etc/magic on your computer. After doing that the files were identified as Excel 2007 on my machine.

answered Jan 31 '20 at 17:39

Chris

2,162
1
6
17

Thank you. I am, however, unable to find `/etc/magic` on my computer. Just so you know, I had installed **python-magic** using `pip install python-magic`. The folder structure is as follows: `\venv\Lib\site-packages\magic\magic.py`, `\venv\Lib\site-packages\magic\__init__.py`, `\venv\Lib\site-packages\magic\libmagic\libmagic.dll`, `\venv\Lib\site-packages\magic\libmagic\magic.mgc`, `\venv\Lib\site-packages\magic\__pycache__\magic.cpython-37.pyc`, `\venv\Lib\site-packages\magic\__pycache__\__init__.cpython-37.pyc` – Premlal Premkumar Feb 03 '20 at 12:10
I see, you're on windows. I'm on Linux. Let me see if I find the windows equivalent. – Chris Feb 03 '20 at 13:23
Hey, Chris. I am wondering if you had a chance to find the Windows equivalent. – Premlal Premkumar Feb 05 '20 at 09:00

File type from pandas.DataFrame.to_excel is "Zip archive data, at least v2.0 to extract"

1 Answers1