1

I am trying to extract data from pdf using PyPDF2 but instead of showing actual text it showing something else in the output what could be the reason behind it?

Here is my code

xfile=open('filename','rb')
pdfReader = PyPDF2.PdfFileReader(xfile)
num=pdfReader.numPages
pageobj=pdfReader.getPage(0)

print(pageobj.extractText())

when I run above program I get this output what could be the reason?

!"#$%#&'(%!#)
(((((((((((((((((((((((((((((((((((((((((((((((((!"#$%#&'(%!#)*+,-./0!$1(230
4444444444445674+8,8,9:+*8
4&*)+!,$-.
4,*7;44444444444444444444444444
4$/012/($/3414546(78(,69:/7;7<=(>"#)?@(A2B2/231
(444<(4=&2#4$>4?&@!0$24A>/$>&&@$>/B4?CDEF4+(;8
4,*7,444*B62C;2/0(#B(%69(%9:77;@("1;23D5B
((((?C<GA47,H#B48:(,*I
4,*7*444E2F2:2B(.2G702=2(A10=2;2=2@("1;23D5B
((((?<GA47*H#B4?CDEF46(8
44%'$HH%(!.*($.,&I&%,%


Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62

2 Answers2

1

Pdf is a file format oriented around page layout. Thus, text present in a pdf can be stored in various methods. It is not guaranteed that your pdf is stored in a format readable by PyPDF.

Moving forward: you can try extracting data from other pdfs before concluding if there is a fault with your PyPdf implementation.

you can also try extracting data from pytesseract and see if your result improves.

Sai Prashanth
  • 144
  • 2
  • 11
0

From PyPDF2s documentation:

This works well for some PDF files, but poorly for others, depending on the generator used.

Your PDF might be of the latter category and you are SOL...

With PyPDF2 not being actively developed anymore (no updates to the Pypi package since 2016) maybe try a more up-to-date package like pdftotext

PnkFlffyUncrn
  • 123
  • 2
  • 6