python-pptx not able to extract text from certain Powerpoint Presentations, but others work fine

Question

Attempting to extract text fields from a large directory of .pptx files, the below script works perfectly for some Powerpoint Presentations:

from pptx import Presentation
import glob

f = open("Scraped PPTX Data.txt", "a", encoding='utf-8')
for eachfile in glob.glob("*.pptx"):
    prs = Presentation(eachfile)
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                f.write(shape.text)

f.close()

Yet on many others(seemingly the very large ones) I receive this huge wall of error:

  File "C:\Users\GLD-POS3\Desktop\SIGNS\PPT_Scraper.py", line 9, in <module>
    prs = Presentation(eachfile)
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\api.py", line 28, in Presentation
    presentation_part = Package.open(pptx).main_document_part
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\package.py", line 125, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\pkgreader.py", line 37, in from_file
    phys_reader, pkg_srels, content_types
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\pkgreader.py", line 70, in _load_serialized_parts
    for partname, blob, srels in part_walker:
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\pkgreader.py", line 106, in _walk_phys_parts
    phys_reader, part_srels, visited_partnames
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\pkgreader.py", line 106, in _walk_phys_parts
    phys_reader, part_srels, visited_partnames
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\pkgreader.py", line 103, in _walk_phys_parts
    blob = phys_reader.blob_for(partname)
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pptx\opc\phys_pkg.py", line 111, in blob_for
    return self._zipf.read(pack_uri.membername)
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\zipfile.py", line 1432, in read
    return fp.read()
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\zipfile.py", line 885, in read
    buf += self._read1(self.MAX_N)
  File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\zipfile.py", line 989, in _read1
    self._update_crc(data)
      File "C:\Users\GLD-POS3\AppData\Local\Programs\Python\Python37-32\lib\zipfile.py", line 917, in  _update_crc
raise BadZipFile("Bad CRC-32 for file %r" % self.name)
zipfile.BadZipFile: Bad CRC-32 for file 'ppt/media/image170.jpeg'

Try just opening that file with `python-pptx` and see if you get the same error. I'm betting you do. Sounds like a corrupted .pptx file. Check opening (and perhaps repairing) it with PowerPoint and see if that doesn't fix-up the file. — scanny, Apr 03 '20 at 21:11
Opened one of the pptx's with python-pptx, and it was allowed. Some of the pptx do appear to be corrupt, but I just received the same error with one that is confirmed not corrupt. — ITboi49, Apr 03 '20 at 21:30

score 1 · Answer 1 · answered Apr 03 '20 at 21:35

When you see a Python exception, you should generally check the end first. In this case it says:

Bad CRC-32 for file 'ppt/media/image170.jpeg'

The thing to know here is that a pptx file is basically just a zip-file with a fancy name.

Try running python -m zipfile -l filename.pptx That should list the contents of the pptx file. In general, a pptx file contains a bunch of xml files and a bunch of images and other media files.

From the error message you can see that the checksum (CRC = cyclic redundancy check) calculated for image170.jpeg doesn't match the value stored in the zipfile.

AFAICT, there is no way of telling a ZipFile to ignore CRC errors.

The thing is, when extracting text, you probably only need to read the XML files in ppt/slides/slideN.xml folder inside the zip-file. You don't need to access the images at all.

Try opening the invalid files using zipfile.ZipFile and manually extract the text from the XML files in ppt/slides.

So what does this mean is wrong with my PowerPoints? I have hundreds of these that I need to strip data from. — ITboi49, Apr 13 '20 at 21:31
@ITboi49 Either that or there is something wrong with the way Python calculates the CRC. — Roland Smith, Apr 13 '20 at 21:39

python-pptx not able to extract text from certain Powerpoint Presentations, but others work fine

1 Answers1