3

I was using PyPdf to read text from a pdf file. However pyPDF does not read the text in pdf line by line, Its reading in some haphazard manner. Putting new line somewhere when its not even present in the pdf.

import PyPDF2
pdf_path = r'C:\Users\PDFExample\Desktop\Temp\sample.pdf'
pdfFileObj = open(pdf_path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
page_nos = pdfReader.numPages
for i in range(page_nos):
        # Creating a page object
        pageObj = pdfReader.getPage(i)
        # Printing Page Number
        print("Page No: ",i)
        # Extracting text from page
        # And splitting it into chunks of lines
        text = pageObj.extractText().split("  ")
        # Finally the lines are stored into list
        # For iterating over list a loop is used
        for i in range(len(text)):
                # Printing the line
                # Lines are seprated using "\n"
                print(text[i],end="\n\n")
        print()

This gives me content as

Our Ref :
21
1
8
88
1
11
5 
 
Name: 
S
ky Blue
 
 
Ref 1 :
1
2
-
34
-
56789
-
2021/2 
 
Ref 2:
F2021004
444
 

Amount: 
$
1
00
.
11
... 

Whereas expected was

Our Ref :2118881115 Name: Sky Blue Ref 1 :12-34-56789-2021/2 Ref 2:F2021004444
Amount: $100.11 Total Paid:$0.00 Balance: $100.11 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CLOSED Collector : Sunny Jane

Here is the link to the pdf file https://pdfhost.io/v/eCiktZR2d_sample2

Himanshu Poddar
  • 7,112
  • 10
  • 47
  • 93
  • 2
    PDF is not a word-processor format. Its aim is to produce a visually similar document on a range of output devices. To do this is may issue PostScript positioning commands between characters that appear visually to be lined up. `PyPDF` can't read the file *line by line* because the format doesn't actually have lines, just bunches of characters that happen not to have PostScript code between them. Also: `PyPDF` was last updated in 2010. It has successors. `PyPDF2` was in turn followed by `PyPDF3` and `PyPDF4`. None of them can really do what you expect. They all work best at page level. – BoarGules Jun 14 '22 at 14:33
  • Does this answer your question? [How to get pypdf to read page content line by line?](https://stackoverflow.com/questions/15459802/how-to-get-pypdf-to-read-page-content-line-by-line) – BoarGules Jun 14 '22 at 14:35
  • The pdf is no longer available and you didn't state which version of PyPDF2 you were using. PyPDF2 improved a lot in the past month – Martin Thoma Jun 22 '22 at 14:29
  • 1
    @MartinThoma I tried with version `2.3.1` as well, it did not work – Himanshu Poddar Jun 22 '22 at 14:33
  • @MartinThoma I have updated the link to pdf file – Himanshu Poddar Jun 22 '22 at 14:34
  • Thank you! Do you have the copyright on that PDF? May I upload it in the PyPDF2 repository as a test case? – Martin Thoma Jun 22 '22 at 14:51
  • @MartinThoma sure, you can use it! share the link with me as well – Himanshu Poddar Jun 22 '22 at 15:00
  • 1
    Thank you for the nice words I'm the current maintainer of PyPDF2, which means that I release it and organize many things. I'm not the original author and I'm certainly not the only developer in this project, though :-) I'm doing quite a bit of community work - having an open eye on stackoverflow.com is one part of that :-) – Martin Thoma Jun 22 '22 at 16:38

1 Answers1

4

I tried a different package called as pdfplumber. It was able to read the pdf line by line in exact way in which I wanted.

1. Install the package pdfplumber

pip install pdfplumber

2. Get the text and store it in some container

import pdfplumber 
pdf_text = None 
with pdfplumber.open(pdf_path) as pdf:
    first_page = pdf.pages[0]
    pdf_text  = first_page.extract_text()
Himanshu Poddar
  • 7,112
  • 10
  • 47
  • 93