1

I am trying to extract the content of all the pdfs from my directory and print the text from all these pdfs as a txt file. I have managed to do so but issue occurs when I frequently have some pdfs with non latin letters. if someone could tell me how I can modify the code below to avoid the error at the bottom. I have looked into similar questions and tried many solutions - none worked. thank you

import glob
import PyPDF2
pdfs=glob.glob("/private/Documents/*.pdf")

for pdf in pdfs:
    with open(pdf, 'rb') as pdfFileObj:
        
        # creating a pdf reader object
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj,strict=False)
        print(pdfReader.numPages)
        pageObj = pdfReader.getPage(0)
        gg = pageObj.extractText()
        print(gg)
        utxt = str(gg)
        print(utxt)
        stxt = utxt.encode('latin-1', 'ignore')
        print(stxt)

with open('quotes.txt', 'w', encoding='utf-8') as f:
    f.write(utxt)

UnicodeEncodeError: 'latin-1' codec can't encode character '\u0445' in position 0: ordinal not in range(256)

JosefZ
  • 28,460
  • 5
  • 44
  • 83
Babiqowski
  • 11
  • 4
  • 1
    `\u0445` is `х` (U+0445, *Cyrillic Small Letter Ha*) hence not in `latin1`. – JosefZ Nov 01 '22 at 14:42
  • Thank you very much. Just wandering how can i adjust the encoding in my code above to not throw errors for latin and cyrillic? Optionally I realise that it is possible to either ignore error symbols or replace them with a question mark? How could I use this here? Even skipping over any errors (omitting given pdf) would be fine. thank you – Babiqowski Nov 02 '22 at 02:03

0 Answers0