I have a PDF linked here. I am trying to extract text from it as a block so I can keep track of every detail, but the data is mixed with the other columns of data. I tried PyPDF2, Tablua and tika but no one gave me the right solution.
Tabula straight away gives a empty list. And I tried to rotate the PDF then use "visitor_body" function as mentioned in documentation to extract a portion of PDF to not to mix up the data but PyPDF2 reads the PDF from top to bottom as it is right way up.
import PyPDF2
pdfPath = "rotate_pages.pdf"
reader = PyPDF2.PdfReader(pdfPath)
page = reader.pages[0]
parts = []
def visitor_body(text, cm, tm, fontDict, fontSize):
y = tm[5]
if y > 150 and y < 900:
parts.append(text)
page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)
print(text_body)
The output is same as the un-rotated PDF. At line 9, The right column data is started just after first one. I cannot make a break at country name or phone numbers as it is not constant.