How to read persian pdf and scrape its contents?

Question

I am trying to read this persian pdf but the result is not decoded well. I also tried utf-16 or utf-32, but no readable results was produced. I want to get the persian dates inside the table. Other libraries were tried but no good text was extracted. After one year of asking this question, still I did not find any good solution for reading Persian PDF files.

from PyPDF2 import PdfFileReader
 urlpdf="https://www.codal.ir/Reports/DownloadFile.aspx?id=LG5QhAhMbfl2DrQQQaQQQ%2bkR9nMQ%3d%3d"
    response = requests.get(urlpdf, verify=False, timeout=5)
with io.BytesIO(response.content) as f:
    #print(response.content)
    pdf = PdfFileReader(f)
    #print(pdf)
    information = pdf.getDocumentInfo()
    number_of_pages = pdf.getNumPages()
    txt = f"""
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    # Here the metadata of your pdf
    print(txt)
    # numpage for the number page
    numpage=0
    page = pdf.getPage(numpage)
    page_content = page.extractText()+"\n"
    # print the content in the page 20 
    g=open("extract.txt",'w',encoding='UTF-8',)
    g.write(page_content)
    g.close
    print(page_content)

What other pdf scraping tools did you use? `pdfminer`, `camelot`, or another one? — K.Cl, Apr 06 '21 at 16:05
I have tried pdfminer but not camelot, could you do it by these libraries? — yasharov, May 23 '21 at 13:34

How to read persian pdf and scrape its contents?

0 Answers0