0

I want to read text from a PDF in Chinese, each page of which are pictures instead of text that can be copied out. Firstly I save each page of the PDF with the following code:

    pdf = pdfium.PdfDocument(fi)
    num_pages = len(pdf)
    for page_number in range(num_pages):
        page = pdf.get_page(page_number)
        pil_image =  page.render(scale = 16, rotation=0,crop=(0, 0, 0, 0)).to_pil()
        pil_image.save(f"image_{page_number}.png")

There are three pages/ images. Then I use the following code to get the text of the PDF:

reader_ch =Reader(['ch_sim'])
text2 = reader_ch.readtext_batched(imag_list,detail = 0, paragraph=True,batch_size=16)

(imag_list is a list of the paths of the images.)

I am running the code on a server, with 3 GPU. This is the result I get.

Traceback (most recent call last):
  File "path_to_python.py", line 267, in <module>
    text2 = reader_ch.readtext_batched(imag_list,detail = 0, paragraph=True,batch_size=16)
  File ".local/lib/python3.9/site-packages/easyocr/easyocr.py", line 553, in readtext_batched
    img, img_cv_grey = reformat_input_batched(image, n_width, n_height)
  File ".local/lib/python3.9/site-packages/easyocr/utils.py", line 791, in reformat_input_batched
    img, img_cv_grey = np.array(img), np.array(img_cv_grey)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!

How can I resolve this issue? (Note that the all pages/ images have the same size. So I don't know where the "inhomogeneous shape" come from)

Aqqqq
  • 816
  • 2
  • 10
  • 27

0 Answers0