Paddle OCR Issue when passing pdf file for text detection

Question

Hi i am facing issue when passing pdf file to paddleocr

My code is:

!paddleocr --image_dir /content/SER-1678793239.pdf --use_angle_cls true --use_gpu false

Issue i am facing is:

AttributeError: 'Document' object has no attribute 'pageCount'

Although it works fine for the image files

I Tried different things changing pdf file name etc and number of pages nothing worked

score 0 · Answer 1 · answered Jul 14 '23 at 17:53

You can edit directly in C:\Python3.10.0\Lib\site-packages\paddleocr\ppocr\utils\utility.py

From line 93:

 with fitz.open(img_path) as pdf:
        
        for pg in range(0, pdf.page_count):
            page = pdf[pg]
            mat = fitz.Matrix(2, 2)
            pm = page.get_pixmap(matrix=mat, alpha=False)

            # if width or height > 2000 pixels, don't enlarge the image
            if pm.width > 2000 or pm.height > 2000:
                pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)

            img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
            img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
            imgs.append(img)
        return imgs, False, True

I changed camelCases to snake_case mentioned below:

pageCount -> page_count , getPixmap -> get_pixmap

You can also refer to this link : https://github.com/PaddlePaddle/PaddleOCR/discussions/8972

Thanks for your response, But I don't want to edit library files directly as it's considered bad practice as a library update can remove all customizations or other code can break. — Asif, Jul 18 '23 at 13:25
The above issue happens due to the version mismatch. they missed to add those changes . I think they will fix in upcoming commits as it discussed in paddleOCR discussions. But this will not break for sure as we only change the method name as per upgrade. Downgrading the version might be affecting the functionality of paddleOCR pdf. — Innoviki, Jul 21 '23 at 12:34

Asif · Accepted Answer · 2023-08-28T13:27:41.740

I Solved the issue by uninstalling the pymupdf library (previously installed with paddleocr automatically) the below command

!pip uinstall pymupdf

Then installed specific version of pymupdf==1.19.0 and issue resolved successfully

!pip install --ignore-installed pymupdf==1.19.0

Now it's working fine!

Note: ! sign in front of commands tells the notebook it's a command (not a simple code) so if you are running code outside of the notebook you need to remove ! from the base.

Paddle OCR Issue when passing pdf file for text detection

2 Answers2