0

I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.

I've tried: PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.

Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.

pdfrw - this only tells me the number of pages.

tabula_py - only gives me the first page. Maybe I'm not looping it correctly.

tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.

from tkinter import filedialog
import os
from tika import parser
import re

# select the file you want 
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
                                    # top of each page

for i in range(1,len(by_page)): # loop page by page
    info = by_page[i] # get one page worth of data from the pdf
    reformated = info.replace("\n", "&") # I replace the new lines with     "&" to make it more readable
    print("Page: ",i) # print page number
    print(reformated,"\n\n") # print the text string from the pdf

This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).

  • 4
    This is probably not straight forward as you would need to do a layout analysis of the PDF and sort the text accordingly. In a PDF the text appearing at the end of the page can be placed at the beginning of the content stream. So the results are most likely not jumbled, it's just how they appear in the PDF file. How they are represented on the screen is a different thing altogether. – Philipp Jun 14 '19 at 07:03
  • 1
    Check https://pypi.org/project/pdfminer/ – Shanavas M Jun 14 '19 at 07:04
  • As Philipp said extracting text in natural reading order from a PDF is a lot more difficult than you might expect. Why is extracting the text in reading order important for you? What do you do with the output? – Ryan Jun 14 '19 at 22:04
  • The order in which the data appears gives me information in itself. In the pdf the placement of the data tells me what it is. I lose that information in the jumble of the pdf conversion. I have been pretty resourceful extracting the needed information from the converted string, but I'm not sure I can get all the information I need. To answer your question, I'm using the data to create a csv file. That csv file will be read by my python program and installed into a database. – Thomas Weeks Jun 15 '19 at 01:38

1 Answers1

-1

I did this for .docx with this code. Where txt is the .docx. Hope this help link

import re
pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')
new = re.sub(pttrn, r'\1\2\n\n', txt)

print(new)
Syafiqur__
  • 531
  • 7
  • 15
  • That created some breaks in the text. The result is still in the same order (jumbled) except for some line breaks.The code you offered does not read the pdf page line by line. – Thomas Weeks Jun 17 '19 at 06:43