-1

I am trying to iterate through many PDF files to extract their text and place them into an excel file. pdfminer3 has allowed me to do so with only one PDF file but I am having trouble with iterating through many PDF files.

from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io
import os
import pandas as pd


pm=[]

directory='location of folder with PDF files'
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    with open(os.path.join(directory,file), 'rb') as fh:

        for page in PDFPage.get_pages(fh,
                                    caching=True,
                                    check_extractable=True):
            page_interpreter.process_page(page)

            text = fake_file_handle.getvalue()
            new = text.replace("\n"," ")
            new_text=new.replace(""," ")
            pm.append(new_text)
    converter.close()
    fake_file_handle.close()
# close open handles
Leeee
  • 1
  • 1
    So what is your problem? What isn't working, since you are opening each file within the directory that endswith '.pdf'. pm should contain the results of appending new_txt from each file. What else is missing? – itprorh66 May 22 '21 at 19:34
  • So this `pm` contains text for all pdf or all pages?. If it contains text for all pages your code should work just fine. If you want to seperate texts from individual files then you need to introduce an inner array. Then `pm` will turn into an array of array. And all the inner arrays will contain text from individual pages. – Roy May 22 '21 at 19:41
  • The error I keep getting is ValueError: I/O operation on closed file – Leeee May 23 '21 at 19:02

1 Answers1

1

You are getting I/O error because you are trying to use fake_file_handle even after closing it. At the end of first pdf fake_file_handle and converter both of them are closed. But without initialising them again you are trying to use them with the second pdf file, causing the error.

Put these 3 lines inside the outer For loop. That way they will be initialised for each pdf and closed when reading that pdf is done.

fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)

Your final code should look something like this.

from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io
import os
import pandas as pd

pm = []

directory = 'location of folder with PDF files'

resource_manager = PDFResourceManager()

for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    with open(os.path.join(directory, file), 'rb') as fh:

        for page in PDFPage.get_pages(fh,
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)

            text = fake_file_handle.getvalue()
            new = text.replace("\n", " ")
            new_text = new.replace("", " ")
            pm.append(new_text)
    converter.close()
    fake_file_handle.close()
# close open handles
print (pm)
Roy
  • 344
  • 2
  • 12