I have a pdf file over 100 pages. There are boxes and columns of text. When I extract the text using PyPdf2 and tika parser, I get a string of of data which is out of order. It is ordered by columns in many cases and skips around the document in other cases. Is it possible to read the pdf file starting from the top, moving left to right until the bottom? I want to read the text in the columns and boxes, but I want the line of text displayed as it would be read left to right.
I've tried: PyPDF2 - the only tool is extracttext(). Fast but does not give gaps in the elements. Results are jumbled.
Pdfminer - PDFPageInterpeter() method with LAParams. This works well but is slow. At least 2 seconds per page and I've got 200 pages.
pdfrw - this only tells me the number of pages.
tabula_py - only gives me the first page. Maybe I'm not looping it correctly.
tika - what I'm currently working with. Fast and more readable, but the content is still jumbled.
from tkinter import filedialog
import os
from tika import parser
import re
# select the file you want
file_path = filedialog.askopenfilename(initialdir=os.getcwd(),filetypes=[("PDF files", "*.pdf")])
print(file_path) # print that path
file_data = parser.from_file(file_path) # Parse data from file
text = file_data['content'] # Get files text content
by_page = text.split('... Information') # split up the document into pages by string that always appears on the
# top of each page
for i in range(1,len(by_page)): # loop page by page
info = by_page[i] # get one page worth of data from the pdf
reformated = info.replace("\n", "&") # I replace the new lines with "&" to make it more readable
print("Page: ",i) # print page number
print(reformated,"\n\n") # print the text string from the pdf
This provides output of a sort, but it is not ordered in the way I would like. I want the pdf to be read left to right. Also, if I could get a pure python solution, that would be a bonus. I don't want my end users to be forced to install java (I think the tika and tabula-py methods are dependent on java).