-1

I want to gather all PDF files from my computer and extract the text from each one. Both functions that I have currently do that, however, some PDF files are giving me this error:

raise PDFPasswordIncorrect 
pdfminer.pdfdocument.PDFPasswordIncorrect

I raised the error in the function that open and reads the PDF files, and that seemed to work in terms of ignoring the error but now its ignoring all the PDF files including the good ones that were not an issue before.

How can I make it so it only ignores the PDF files that give me this error and not every single PDF?

def pdfparser(x):
    try:
        raise PDFPasswordIncorrect(pdfminer.pdfdocument.PDFPasswordIncorrect)
        fp = open(x, 'rb')
        rsrcmgr = PDFResourceManager()
        retstr = io.StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        # Create a PDF interpreter object.
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
    except (RuntimeError, TypeError, NameError,ValueError,IOError,IndexError,PermissionError):
         print("Error processing {}".format(name))

    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data =  retstr.getvalue()

    return(data)

    def pdfs(files):
            for name in files:
                    try:
                        IP_list = (pdfparser(name))
                        keyword = re.findall(inp,IP_list)
                        file_dict['keyword'].append(keyword)
                        file_dict['name'].append(name.name[0:])
                        file_dict['created'].append(time.ctime(name.stat().st_ctime))
                        file_dict['modified'].append(time.ctime(name.stat().st_mtime))
                        file_dict['path'].append(name)
                        file_dict["content"].append(IP_list)
                    except (RuntimeError, TypeError, NameError,ValueError,IOError,IndexError,PermissionError):
                        print("Error processing {}".format(name))
                    #print(file_dict)
            return(file_dict)
    pdfs(files)
EzLo
  • 13,780
  • 10
  • 33
  • 38
Cald0002
  • 13
  • 4

1 Answers1

1

Why are you manually raising an error that would happen if you opened an Pdf that is password protected if you do not supply the correct password?

This error is raised by your code every time!

Instead you need to catch the error if it happens and skip that file. See corrected code:

def pdfparser(x):
    try: 
        # try to open your pdf here - do not raise the error yourself!
        # if it happens, catch and handle it as well

     except PDFPasswordIncorrect as e:      # catch PDFPasswordIncorrect
         print("Error processing {}: {}".format(name,e)) # with all other errors
         # no sense in doing anything if you got an error until here
         return None 


    # do something with your pdf and collect data
    data = []

    return(data)


    def pdfs(files):
        for name in files: 
            try:
                IP_list = pdfparser(name)

                if IP_list is None:             # unable to read for whatever reasons
                    continue                    # process next file

                # do stuff with your data if you got some                

            # most of these errors are already handled inside pdfparser
            except (RuntimeError, TypeError, NameError,ValueError,
                    IOError,IndexError,PermissionError):
                print("Error processing {}".format(name))

    return(file_dict)

    pdfs(files)

The second try/catch: in def pdfs(files): can be shrunk down, all the file related errors happen inside def pdfparser(x): and are handled there. The rest of your code is incomplete and references stuff I do not know about:

file_dict
inp
name # used as filehandle for .stat() but is a string etc
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • Hi the PDFPasswordIncorrect in expect is giving me an error saying that it is a undefined variable. Should I define it somewhere? – Cald0002 May 04 '19 at 14:57
  • @Cald no, try `except pdfminer.pdfdocument.PDFPasswordIncorrect as e:` instead. It is probably hidden inside namespaces – Patrick Artner May 04 '19 at 14:59
  • I put pdfminer.pdfdocument.PDFPasswordIncorrect in the except inside the PDF function instead of the PDFparser function and it worked! Thank you so much! – Cald0002 May 04 '19 at 15:29