I read some PDF files and unfortunately, I am using only PyPDF2.
with open(filename1, 'rb') as pdfFileObj:
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
pageObj = pdfReader.getPage(0)
gg = pageObj.extractText()
print(gg) #<- first print shows text
type(gg) # <- str
gg #<- second print shows bytes as string eg. '\x00K\x00o\x00n\x00t\x00o\x00a\x00u\x00s\x00z\x00ü\x00g\x00e\n\x00S\x00L\x00'
My issue is that gg
is not bytes but a string representation of the bytes so I cannot decode into text.
How can I access the printed str
or convert the bytes representation to text so I can work with some regex?