Python encoding errors latin-1 PyPDF2

Question

I am trying to extract the content of all the pdfs from my directory and print the text from all these pdfs as a txt file. I have managed to do so but issue occurs when I frequently have some pdfs with non latin letters. if someone could tell me how I can modify the code below to avoid the error at the bottom. I have looked into similar questions and tried many solutions - none worked. thank you

import glob
import PyPDF2
pdfs=glob.glob("/private/Documents/*.pdf")

for pdf in pdfs:
    with open(pdf, 'rb') as pdfFileObj:
        
        # creating a pdf reader object
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj,strict=False)
        print(pdfReader.numPages)
        pageObj = pdfReader.getPage(0)
        gg = pageObj.extractText()
        print(gg)
        utxt = str(gg)
        print(utxt)
        stxt = utxt.encode('latin-1', 'ignore')
        print(stxt)

with open('quotes.txt', 'w', encoding='utf-8') as f:
    f.write(utxt)

UnicodeEncodeError: 'latin-1' codec can't encode character '\u0445' in position 0: ordinal not in range(256)

`\u0445` is `х` (U+0445, *Cyrillic Small Letter Ha*) hence not in `latin1`. — JosefZ, Nov 01 '22 at 14:42
Thank you very much. Just wandering how can i adjust the encoding in my code above to not throw errors for latin and cyrillic? Optionally I realise that it is possible to either ignore error symbols or replace them with a question mark? How could I use this here? Even skipping over any errors (omitting given pdf) would be fine. thank you — Babiqowski, Nov 02 '22 at 02:03

Python encoding errors latin-1 PyPDF2

0 Answers0