-1

I have written a function that converts each pdf from a directory into text and I want to get the converted text from the pdf's as txt files. I am getting "TypeError: expected str, bytes or os.PathLike object, not tuple" error in my code. Can anyone please help me with this. Attaching code here:

import io
import os
import os.path
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

    def extract_text_from_pdf(pdf_path):
        resource_manager = PDFResourceManager()
        fake_file_handle = io.BytesIO()
        converter = TextConverter(resource_manager, fake_file_handle)
        page_interpreter = PDFPageInterpreter(resource_manager, converter)

        with open(pdf_path, 'rb') as fh:
            for page in PDFPage.get_pages(fh, 
                                          caching=True,
                                          check_extractable=True):
                page_interpreter.process_page(page)

            text = fake_file_handle.getvalue()

        # close open handles
        converter.close()
        fake_file_handle.close()

        if text:
            return text

    def save_to_txt(lst):
            for i, ele in enumerate(lst): 
                txtfile = "{}.txt".format(i)
                files = extract_text_from_pdf(ele)
                with open(txtfile, "w") as textfile:
                    textfile.write(files) 

    if __name__ == '__main__':
        pdf_path = 'C:\\Users\\Lenovo\\.spyder-py3\\OCR'
        for root, _, files in os.walk(pdf_path):
            for filename in files:
                filepath = os.path.join(root, filename)
                extract_text_from_pdf(filepath)

        for f in filepath:
            save_to_txt(f)

The error is as follows:

runfile('C:/Users/Lenovo/.spyder-py3/updatedpy.py', wdir='C:/Users/Lenovo/.spyder-py3')
Traceback (most recent call last):

  File "<ipython-input-17-f6b3bb00c382>", line 1, in <module>
    runfile('C:/Users/Lenovo/.spyder-py3/updatedpy.py', wdir='C:/Users/Lenovo/.spyder-py3')

  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/Lenovo/.spyder-py3/updatedpy.py", line 47, in <module>
    extract_text_from_pdf(file)

  File "C:/Users/Lenovo/.spyder-py3/updatedpy.py", line 22, in extract_text_from_pdf
    with open(pdf_path, 'rb') as fh:

TypeError: expected str, bytes or os.PathLike object, not tuple
Swordsman
  • 143
  • 1
  • 2
  • 14
  • 1
    Which line is getting that error? It's best to post the full traceback. – Barmar Jan 21 '19 at 09:39
  • Thanks for replying. I have attached the full traceback with the code. – Swordsman Jan 21 '19 at 09:44
  • 1
    [`os.walk()`](https://docs.python.org/3/library/os.html#os.walk) returns a tuple of root, directories, and files--not individual paths. You need to do a bit more work with the results of os.walk() to get what you want. If you're using a new enough version of Python, you might want to consider [`os.scandir()`](https://docs.python.org/3/library/os.html#os.scandir). – John Szakmeister Jan 21 '19 at 09:51
  • You're calling `extract_text_from_pdf()` from both the `for file in os.walk` loop and also inside `save_to_text()`. Do you really need to do both? Also `for f in file:` doesn't seem right. `file` is not a list of anything. And `save_to_text()` expects the argument to be a list. – Barmar Jan 21 '19 at 09:54
  • Thanks . can you please suggest the code changes as I am new to python. – Swordsman Jan 21 '19 at 10:01
  • Also, how to iterate over the nt.DirEntry so that I get the converted txt files – Swordsman Jan 21 '19 at 10:22

1 Answers1

2

The error come from the use of the os.walk method in your main section, which does not return a filename, but a tuple. See the os documentation for more details.

Edit: You could use the os.walk method like this:

for root, _, files in os.walk(pdf_path):
    for filename in files:
        filepath = os.path.join(root, filename)
        extract_text_from_pdf(filepath)

Or You could use the path.py library and use the walkfiles method. That way you could do:

from path import Path

pdf_path = Path('C:\\dev') 

for file in pdf_path.walkfiles():
    extract_text_from_pdf(file)
olinox14
  • 6,177
  • 2
  • 22
  • 39
  • Thanks, I have edited my code with the inputs you have provided but I am getting an error 'PDFTextExtractionNotAllowed: Text extraction is not allowed: <_io.BufferedReader name='C:'. Can you please let me know the changes I need to make – Swordsman Jan 21 '19 at 12:30
  • I don't think this bug is related to the way you open your file. It seems like your pdf is protected or encrypted. Take a look at [this post](https://stackoverflow.com/questions/39981980/pdfminer-pdftextextractionnotallowed-error?noredirect=1) – olinox14 Jan 21 '19 at 12:46