0

I've updated the question to contain the bulk of the code as I feel there may be some of it that is blocking each other... Can be tested by simply adding a pdf file or two to your c:\temp folder (on windows). I've just started with Python so may be missing basic stuff...

import glob
from datetime import datetime
from pathlib import Path
import PyPDF4
from pdfrw import PdfReader, PdfWriter


def safe_open_pdf(pdf):
    pdf_reader = None
    result = True

    file = open(pdf, 'rb')
    try:
        pdf_reader = PyPDF4.PdfFileReader(file)
        result = True
    except:
        # some older PDF files on my disk raise a missing EOF error, which cannot be handled by PyPDF4
        print(pdf.split('\\')[-1] + " needs to be fixed")
        result = False

    if not result:
        # if file had EOF error, I "rebuild" it with PdfReader and PdfWriter
        x = PdfReader(pdf)
        y = PdfWriter()
        y.addpages(x.pages)
        y.write(pdf)
        pdf_reader = PyPDF4.PdfFileReader(file)

    return pdf_reader


def move_processed_pdf(source_file):
    Path(new_path).mkdir(parents=True, exist_ok=True)

    print("Copying to " + new_path + new_file)

    f = open(PDFFile, 'rb')
    x = PdfReader(f)
    y = PdfWriter()
    y.addpages(x.pages)
    y.write(new_path + new_file)

    f.close()
    # time.sleep(5)
    Path(PDFFile).unlink()


if __name__ == '__main__':

    relevant_path = 'C:\\temp\\'
    file_count = 0
    new_path = 'C:\\temp\\processed\\'

    for PDFFile in glob.iglob(relevant_path + '*.pdf', recursive=True):

        new_file = datetime.today().strftime('%Y-%m-%d') + PDFFile.split('\\')[-1]

        print('Processing File: ' + PDFFile.split('\\')[-1])

        pdfReader = safe_open_pdf(PDFFile)
        file_count += 1
        num_pages = pdfReader.numPages

        print(num_pages)

        page_count = 0
        text = ''

        while page_count < num_pages:
            pageObj = pdfReader.getPage(page_count)
            page_count += 1
            text += pageObj.extractText()

        # Main processing occurs here

        move_processed_pdf(PDFFile)

the issue I get is PermissionError: [WinError 32] The process cannot access the file because it is being used by another process.

folders and files exist.

any ideas?

Spiffo
  • 3
  • 4
  • Latest pdfrw might read from path with `PdfReader(PDFFile)` and manual manipulation with open/close not needed. Did you try to `unlink` on first line to check that you've this rights? – frost-nzcr4 Apr 22 '20 at 16:18
  • Hi thanks for responding, yes I have checked rights. This is just snip of whole procedure, could issue instead lie with for PDFFile in glob.iglob(relevant_path + '*.pdf', recursive=True): – Spiffo Apr 22 '20 at 16:33
  • make sure the file exists before un-linking. put it on a try/catch – Eddwin Paz Apr 22 '20 at 18:12
  • Hi I have tested both, no luck: If i put unlink before the "print("copying to...")" line, it throws the same error And file does exist, as it is being picked up in the loop – Spiffo May 03 '20 at 13:29

0 Answers0