0

I have tried PyPDF2 to extract and parse text from PDF using following code segment;

import PyPDF2
import re

pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

rawText = pdfReader.getPage().extractText()
extractedText = re.split('\n|\t', rawText)
print("Extracted Text: " + str(extractedText) + "\n")

Case 1: When I try to parse pdf text, I failed to parse them as exactly as they appear in pdf. For example,

enter image description here

In this case, line break or new line can't be found in both rawText or extractedText and results like below-

    input field, your old automation script will try to submit a form with missing data unless you update it.Another common case is asserting that a specific error message appeared and then updating the error message, which will also break the script.

Case 2: And for following case,

enter image description here

It gives result as-

2B. Community Living5710509-112C. Lifelong Learning69116310-122D. Employment5710509-11

which is more difficult to parse and differentiate between these individual scores. Is it possible to parse perfectly these scenario with PyPDF2 or any other Python library?

  • 1
    Check this answer, it may be related: https://stackoverflow.com/questions/11017379/pypdf-ignores-newlines-in-pdf-file – Isma Sep 24 '17 at 10:58
  • This helps every visible separated strings or lines to parse in different new lines using `extractText(Tj_sep="\n")` but yet have to achieve 'Line Spacing' between each paragraph. Thanks btw @Isma – Nawshad Rehan Rasha Sep 25 '17 at 06:47

0 Answers0