0

by asynchronously what I mean to say is as you can see in the second screenshot, the address and phone details are getting mixedI have a task to parse a pdf file using python scripting with some specific attributes. I have to fetch first name, last name, address and email. I have done the below.

from PyPDF2 import PdfFileReader
f = open('CV_Smith.pdf', 'rb')
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText()
f.close()
print(contents)

but getting a problem because the text is coming asynchronously and difficult to process.

screenshots of given pdf. enter image description here enter image description here

thank you in advance.

  • What do you mean by coming "asynchronously"? – Sush Mar 22 '17 at 05:40
  • Seems like a part of a CV, First of all it's not an easy job to do. In your attempts there is no attempts for extracting any of the fields. If the documents follows a single format easy to identify the fields. Else you have to use some techniques like `regex` for `email` and all. And better play with a formatted text instead of normal text. Formatted text holds more information. – Rahul K P Mar 22 '17 at 06:01
  • exactlly, the text is not formatted and that is the issue. – Nileema Gaykwad Mar 22 '17 at 06:26
  • you can use `regex`, but again it's risky as you data is unstructured – akash karothiya Mar 22 '17 at 06:30

1 Answers1

0

pypdf (and also PyPDF2) improved a lot since you asked the question. It might now work as you want.

However, what you want might not be possible without heuristics. You want a semantic extraction / "boxing" of text fragments. This information is not within the PDF. In the worst case, every single letter is absolutely positioned on the PDF. Without giving any hint which letters belong to the same word - yet to the same "block" of text.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958