0

I am trying to pull all of the text from a PDF file. I am using online PDF's, and they include tables. This code works, however, when it gets to a table in the PDF, the text from the table is printed by columns instead of rows which is messing up my data. Is there a way to have the table be read by rows without having to go through the tables separately? I still need all of the text from the PDF to print together. I am using python.

def getTextFromPDF(url):
    open = urllib.request.urlopen(url).read()
    memoryFile = io.BytesIO(open)
    
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    
    
    with memoryFile as fh:
    
        for page in PDFPage.get_pages(fh,
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)
    
        text = fake_file_handle.getvalue()
    
    # close open handles
    converter.close()
    fake_file_handle.close()
    return text

1 Answers1

0

This answer is for anyone encountering pdfs with images and needing to use OCR. I could not find a workable off-the-shelf solution; nothing that gave me the accuracy I needed.

Here are the steps I found to work.

Use pdfimages from https://poppler.freedesktop.org/ to turn the pages of the pdf into images.

Use Tesseract to detect rotation and ImageMagick mogrify to fix it.

Use OpenCV to find and extract tables.

Use OpenCV to find and extract each cell from the table.

Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.

Use Tesseract to OCR each cell.

Combine the extracted text of each cell into the format you need.

I wrote a python package with modules that can help with those steps.

Repo: https://github.com/eihli/image-table-ocr

Docs & Source: https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

Some of the steps don't require code, they take advantage of external tools like pdfimages and tesseract. I'll provide some brief examples for a couple of the steps that do require code.

Finding tables: This link was a good reference while figuring out how to find tables. https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/