Python Script to Iterate through PDF's in a directory and find a matching line

Question

Currently i get all my reports delivered to me via email attached as a pdf. What i have done is set outlook to automatically download those files to a certain directory every day. Sometimes those pdfs dont have any data in them and only contain the line "There is no data to present that matches the selection criteria". I would like to create a python program that iterates through every pdf file in that directory, open it and look for those words, if they contain that phrase then delete that particular pdf. If they do not then do nothing. Through help with reddit i have pieced together the code below:

import PyPDF2
import os

directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\'
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    with open("{}/{}".format(directory,file), 'rb') as pdfFileObj:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        pageObj = pdfReader.getPage(0)
        if "There is no data to present that matches the selection criteria" in pageObj.extractText():
            print("{} was removed.".format(file))
            os.remove(file)

I have tested with 3 files one containing the matching phrase. No matter how the files are named or what order it will fail. I have tested it with one file in the directory named 3.pdf. Below is the error code is get.

FileNotFoundError: [WinError 2] The system cannot find the file specified: >'3.pdf'

This would reduce my workload dramatically and be a great learning example for me the newbie. All help/criticism welcome.

File path manipulation using string replacement generally results in typos like this. Try using `os.path.join(path, *paths)`, documented here: https://docs.python.org/2/library/os.path.html — jsmiao, Jun 14 '17 at 19:35
Here is my new code -> [link]https://repl.it/Ilkx/0 it gives a new error message which could be progress. The error is 'TypeError: expected str, bytes or os.PathLike object, not module' . Which i am certain is because i have no idea what i am doing. — user3487244, Jun 14 '17 at 20:00

score 2 · Answer 1 · answered Jun 14 '17 at 20:04

2

See below:

import PyPDF2
import os

directory = 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\'
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    with open(os.path.join(directory,file), 'rb') as pdfFileObj:  # Changes here
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        pageObj = pdfReader.getPage(0)
        if "There is no data to present that matches the selection criteria" in pageObj.extractText():
            print("{} was removed.".format(file))
            os.remove(file)

answered Jun 14 '17 at 20:04

jsmiao

433
1
5
13

That produced the error "FileNotFoundError: [WinError 2] The system cannot find the file specified: '3.pdf'" – user3487244 Jun 14 '17 at 20:14
Looks like you need to specify the full filepath for `os.remove(file)`. Try `os.remove(os.path.join(directory,file))` and see if it works. – jsmiao Jun 14 '17 at 20:35
Getting Closer! " PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\jmoorehead\\Desktop\\A2IReports\\3.pdf'" – user3487244 Jun 14 '17 at 20:48
close the other process that's using it. it's probably open. – jsmiao Jun 14 '17 at 20:51
Interestingly it said "2.pdf was removed." but it wasnt. Then at the end of the trace it said that 'FileNotFoundError: [WinError 2] The system cannot find the file specified: '2.pdf'' – user3487244 Jun 14 '17 at 21:08

Python Script to Iterate through PDF's in a directory and find a matching line

1 Answers1