0

I have a PDF linked here. I am trying to extract text from it as a block so I can keep track of every detail, but the data is mixed with the other columns of data. I tried PyPDF2, Tablua and tika but no one gave me the right solution.

Tabula straight away gives a empty list. And I tried to rotate the PDF then use "visitor_body" function as mentioned in documentation to extract a portion of PDF to not to mix up the data but PyPDF2 reads the PDF from top to bottom as it is right way up.

import PyPDF2
pdfPath = "rotate_pages.pdf"

reader = PyPDF2.PdfReader(pdfPath)
page = reader.pages[0]

parts = []

def visitor_body(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 150 and y < 900:
        parts.append(text)

page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)

Output of above code starting from 3rd line.

The output is same as the un-rotated PDF. At line 9, The right column data is started just after first one. I cannot make a break at country name or phone numbers as it is not constant.

  • What is your desired result? Do you want each address block separated by a space for example? – Nick Nov 16 '22 at 14:21
  • A list or something so each data is separated form other as data is not ending at any constant thing i.e. website or phone number. – Sarim Bin Waseem Nov 16 '22 at 14:32

0 Answers0