How to extract ALL IMAGES and text from all pptx file slides using python?

Question

I'm able to read images from pptx file but not all images. I'm unable to extract the images presented in a slide with title or other text. Here is my code and please help me.

from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE
import glob
import os
import codecs
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = '/usr/local/Cellar/tesseract/4.1.1/bin/tesseract'
from pytesseract import image_to_string

n=0
def write_image(shape):
    global n
    image = shape.image
    # get image 
    image_bytes = image.blob
    # assinging file name, e.g. 'image.jpg'
    image_filename = fname[:-5]+'{:03d}.{}'.format(n, image.ext)
    n += 1
    print(image_filename)
    os.chdir("directory_path/readpptx/images")
    with open(image_filename, 'wb') as f:
        f.write(image_bytes)
    os.chdir("directory_path/readpptx")    


def visitor(shape):
    if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
            write_image(shape)

def iter_picture_shapes(prs1):
    for slide in prs1.slides:
        for shape in slide.shapes:
                visitor(shape)


file = open("directory_path/MyFile.txt","a+")
for each_file in glob.glob("directory_path/*.pptx"):
    fname = os.path.basename(each_file)
    file.write("-------------------"+fname+"----------------------\n")
    prs = Presentation(each_file)
    print("---------------"+fname+"-------------------")
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)
                file.write(shape.text+"\n")
    iter_picture_shapes(prs)

file.close()

Above code is able to extract images from pptx slides which have no text or title but not able to extract images in slides with text or title.

scanny · Accepted Answer · 2020-03-17T19:02:44.903

0

Try also iterating over slide masters and slide layouts. If there are "background" images that's where they will be. The same for shape in slide.shapes: mechanism works on slide masters and slide layouts; they are a variant of the polymorphic Slide object with the same shape-access semantics.

I don't think your problem is strictly related to the presence of a title or text on the slide. Perhaps those particular slides use a layout that includes some background images. If you open the slide and clicking on the image does not select it (give it bounding box) that indicates it is a background image and resides on the slide layout or possibly the slide master. This is how logos are commonly implemented to show up on every slide.

You may also want to consider iterating over the Notes slide for each slide when it has one, if there is text and/or images in there you are interested in. It is uncommon to find images in the slide notes but PowerPoint supports it.

Another approach is the traverse the underlying .pptx package (as a Zip archive) and extract the images that way.

edited Mar 17 '20 at 19:02

answered Mar 15 '20 at 21:08

scanny

26,423
5
54
80

Hi Scanny, even i tried to .zip my .pptx file and tried to find out media folder but when i tried with the following code the created .zip file having only one file i.e my .pptx file. `import zipfile` `my_zip = zipfile.ZipFile('C:/Users/sk42550/Downloads/samplepptx1.zip', 'w')` `my_zip.write('C:/Users/sk42550/Downloads/samplepptx1.pptx', compress_type=zipfile.ZIP_DEFLATED)` `my_zip.close()` Could you please help me out in this issue? – shivakumar kasha Mar 16 '20 at 06:00
A `.pptx` file is _already_ a Zip archive. Just unzip it to see what's inside. – scanny Mar 16 '20 at 16:00
Hi Scanny, could you please help to extract data from autoshapes in pptx files using python. I've tried this way. `if shape.auto_shape_type==MSO_SHAPE.RECTANGLE: TextData.append(shape.text) if shape.auto_shape_type==MSO_SHAPE.ISOSCELES_TRIANGLE: TextData.append(shape.text)` – shivakumar kasha Apr 01 '20 at 11:07
@shivakumarkasha that's a separate question. If you create a new question post for it and tag it with "python-pptx" I'll have a look. – scanny Apr 01 '20 at 17:43

How to extract ALL IMAGES and text from all pptx file slides using python?

1 Answers1