10

I am trying to find a way to look in a folder and search the contents of all of the powerpoint documents within that folder for specific strings, preferably using Python. When those strings are found, I want to report out the text after that string as well as what document it was found in. I would like to compile the information and report it in a CSV file.

So far I've only come across the olefil package, https://bitbucket.org/decalage/olefileio_pl/wiki/Home. This provides all of the text contained in a specific document, which is not what I am looking to do. Please help.

kacey
  • 111
  • 1
  • 1
  • 4
  • 4
    hi kacey! Welcome to Stackoverflow! Here at Stackoverflow, we help people fix and sometimes rewrite their existing code to correctly work. I'm afraid your question is a bit off-topic for the SO site. Here how; What your basically asking is, "How can I write some code to perform x, then y, then, z". While those types of question can be appropriate, you should show what **you** have tried. Make an attempt at solving your problem before asking here. Who knows, you may figure it out yourself! If what you tried didn't work, we'll be more than happy to help you fix it. Good luck! – Christian Dean Sep 09 '16 at 21:57
  • Files with type ".pptx" are zip files. – Marichyasana Sep 09 '16 at 22:13

5 Answers5

12

Actually working

If you want to extract text:

  • import Presentation from pptx (pip install python-pptx)
  • for each file in the directory (using glob module)
  • look in every slides and in every shape in each slide
  • if there is a shape with text attribute, print the shape.text

from pptx import Presentation
import glob

for eachfile in glob.glob("*.pptx"):
    prs = Presentation(eachfile)
    print(eachfile)
    print("----------------------")
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)
PythonProgrammi
  • 22,305
  • 3
  • 41
  • 34
  • 1
    Also, if `PackageNotFoundError` is thrown, it can be fixed by passing a file object instead: `f = open(, "rb")` and then `prs = Presentation(f)`. – Viseshini Reddy Jan 08 '19 at 10:28
  • The os.listdir() command in Python 2.7 won't work unless it reads something like `os.listdir('.')`. Other than that, it worked well for me. – Tensigh Jun 02 '19 at 21:32
  • Yes, in python 2.7 you have to use os.listdir('.'). I am gonna change the code. – PythonProgrammi Jun 03 '19 at 05:06
  • 1
    This solution worked for me. The only remark is that python package is called python-pptx, so the installation command should be "pip install python-pptx". – mskoryk Nov 21 '19 at 07:38
5

tika-python

A Python port of the Apache Tika library, According to the documentation Apache tika supports text extraction from over 1500 file formats.

Note: It also works charmingly with pyinstaller

Install with pip :

pip install tika

Sample:

#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file

Link to official GitHub

  • This worked like a charm, thanks! I forgot to filter to pptx and it included pdfs. Read them perfectly from what i can tell so far. – Jeremy Giaco May 15 '19 at 14:09
4

python-pptx can be used to do what you propose. Just at a high level, you would do something like this (not working code, just and idea of overall approach):

from pptx import Presentation

for pptx_filename in directory:
    prs = Presentation(pptx_filename)
    for slide in prs.slides:
        for shape in slide.shapes:
            print shape.text

You'd need to add the bits about searching shape text for key strings and adding them to a CSV file or whatever, but this general approach should work just fine. I'll leave it to you to work out the finer points :)

scanny
  • 26,423
  • 5
  • 54
  • 80
0

Textract-Plus

Use textract-plus which can extract text from most of the document extensions including pptx and pptm. refer docs

Install-

pip install textract-plus

Sample-

import textractplus as tp
text=tp.process('path/to/yourfile.pptx')

for your case-

import os
import pandas as pd
import textractplus as tp
files_csv=[]
your_dir='.'
for f in os.listdir(your_dir):
    if f.endswith('pptx') or f.endswith('pptm'):
        text=tp.process(os.join(your_dir,f))
        files_csv.append([f,text])
pd.Dataframe(files_csv,columns=['filename','text']).to_csv('your_csv.csv')

this code will fetch all the pptx and pptm files from directory and create a csv with first column as filename and second as text extracted from that file

vhx.ai
  • 1
  • 1
0
import os
import textract
files_csv = []
your_dir = '.'

for f in os.listdir(your_dir):
   if f.endswith('pptx') or f.endswith('pptm'):
      text = tp.process(os.path.join('sample.pptx'))
         print(text)
        
  • New answers to old, well-answered questions should contain ample explanation on how they complement the other answers. – Gert Arnold Mar 05 '22 at 08:56