1

I am tring to convert PDF to text file. But the characters which are unique to the language of the PDF file, I mean the ones not in English, are lost in conversion.

How can I convert PDF to text without losing any data?

Thanks in advance.

import PyPDF2 

filename = 'Book1.pdf' 
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

str = ""
for i in range(pdfReader.getNumPages()):
    str += pdfReader.getPage(i).extractText()

with open("text2.txt", 'w', encoding = 'utf-8') as f:
    f.write(str)
Anar
  • 23
  • 3
  • I think generally it would be enough to let us know what the special characters are, and which language (or character set) you are trying to work with. – dennlinger Jun 12 '21 at 19:50
  • It is Azerbaijani. The characters are "ə, ç, ı, ğ, ö". – Anar Jun 13 '21 at 06:10

0 Answers0