Iterate over pathlib paths and python-docx: zipfile.BadZipFile

Question

My python skills are a bit rusty since I recently primarily used Rstats. However I ran into the following problem, my goal is that I want to recursively iterate over all .docx files in a directory and change some of the core attributes with the python-docx package.

For the loop, I first created a list with pathlib and glob

from docx import Document
from docx.shared import Inches
import pathlib

# Reading the stats dir
root_dir = pathlib.Path(r"C:\some\Björn\PycharmProjects\mre_docx")
# Get all word files in the stats directory
files = [x for x in root_dir.glob("**/*.docx") if x.is_file()]
files

Output of files looks fine.

[WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test1.docx'),
 WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test2.docx')]

When I now want to read in a document with the list I get a zip error (see full traceback below)

document = Document(files[1])
Traceback (most recent call last):
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-482c5438fa33>", line 1, in <module>
    document = Document(files[1])
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\api.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\package.py", line 128, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
    self._zipf = ZipFile(pkg_file, 'r')
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
    self._RealGetContents()
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

However just running the same line of code, without the list works fine (except for differences in the path separator / and r"\", which I thought should not matter due to the fact that the lists contains pathlib.Path objects).

document = Document(pathlib.Path(r"C:\Users\Björn\PycharmProjects\mre_docx\test1.docx"))

Edit to Comment

I created a total of 4 new word files for this mre. Now I entered text in two of them and two are empty. And to my surprise I found out that the empty ones result in the error.

for file in files:
    try:
        document = Document(file)
    except:
        print(f"The file: {file} appears to be corrupted")

Output:

The file: C:\Users\Björn\PycharmProjects\mre_docx\new_file.docx appears to be corrupted
The file: C:\Users\Björn\PycharmProjects\mre_docx\test2.docx appears to be corrupted

Semi Solution to Future Readers

Add a try and except block around the call to Document("Path/to/file.docx"), and print out the respective file for which the function failed. In my case it where just a few, which I could easily edit manually.

Please check if [this question](https://stackoverflow.com/questions/47719485/read-docx-file-error) is the same problem. — Morgan Nicholson, Mar 11 '22 at 00:17
I created the files just for the purpose of the MRE, how could they be corrupted? They are empty. However I manually created a few more, and tested and will edit the output above. Somehow it works only for one file?! — Björn, Mar 11 '22 at 00:53
However with the addition of `try` and `except` I found that only a very small minority of files (3) are failing, which is acceptable I just can edit these manually. — Björn, Mar 11 '22 at 01:04

score 1 · Accepted Answer · answered Mar 11 '22 at 01:15

You are not doing wrong, since documents are empty you are getting this error. If you open those files type something, you will not get any error. But According to https://python-docx.readthedocs.io/en/latest/user/documents.html

You can open word documents with different codes.

First:

document = Document()
document.save(files[1])

Second:

document = Document(files[1])
document.save(files[1])

Also According to docs you can open them like files:

with open(files[1], 'rb') as f:
    document = Document(f)

Hey, just double checked and yes you are right. In my use case (outside of the MRE) of >500 word files, the three for which I got the error are all empty. — Björn, Mar 11 '22 at 01:21

Iterate over pathlib paths and python-docx: zipfile.BadZipFile

Edit to Comment

Semi Solution to Future Readers

1 Answers1