0

I read some PDF files and unfortunately, I am using only PyPDF2.

with open(filename1, 'rb') as pdfFileObj:

        # creating a pdf reader object
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        print(pdfReader.numPages)
        pageObj = pdfReader.getPage(0)
        gg = pageObj.extractText()
        print(gg) #<- first print shows text
        type(gg) # <- str
        gg    #<- second print shows bytes as string eg. '\x00K\x00o\x00n\x00t\x00o\x00a\x00u\x00s\x00z\x00ü\x00g\x00e\n\x00S\x00L\x00'

My issue is that gg is not bytes but a string representation of the bytes so I cannot decode into text. How can I access the printed str or convert the bytes representation to text so I can work with some regex?

wovano
  • 4,543
  • 5
  • 22
  • 49
pRo
  • 89
  • 12
  • not able to reproduce the problem. Notice that PyPDF2 newest version use other methods (still backcompatible). See [doc](https://pypdf2.readthedocs.io/en/latest) – cards Nov 27 '22 at 17:15

1 Answers1

0

A solution would be to replace all \x00 with "" by using re.sub('\x00','',gg) so I have only the text. If there is another more efficient way, I would like to have a look.

pRo
  • 89
  • 12
  • This would work indeed if the text is encoded as UTF-16 and only contains Unicode characters in the range 0-255 (no "special" characters), or if the `\x00` bytes are inserted for another (unknown) reason. However, if the text is UTF-16 encoded and contains special characters, this approach will lead to corrupted text. I guess you will just have to try and see which solution works best for your PDF file. – wovano Nov 27 '22 at 22:35