I have tried PyPDF2 to extract and parse text from PDF using following code segment;
import PyPDF2
import re
pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
rawText = pdfReader.getPage().extractText()
extractedText = re.split('\n|\t', rawText)
print("Extracted Text: " + str(extractedText) + "\n")
Case 1: When I try to parse pdf text, I failed to parse them as exactly as they appear in pdf. For example,
In this case, line break or new line can't be found in both rawText
or extractedText
and results like below-
input field, your old automation script will try to submit a form with missing data unless you update it.Another common case is asserting that a specific error message appeared and then updating the error message, which will also break the script.
Case 2: And for following case,
It gives result as-
2B. Community Living5710509-112C. Lifelong Learning69116310-122D. Employment5710509-11
which is more difficult to parse and differentiate between these individual scores. Is it possible to parse perfectly these scenario with PyPDF2 or any other Python library?